Hanoi University of Science and Technology
School of Information and Communication Technology
Master Thesis in Computer Science
Semi supervised learning for medical image segmentation
PHAM VAN TOAN
toan.pv211049m@sis.hust.edu.vn
Supervisor: Dr. Dinh Viet Sang
Hanoi 10-2023
Author’s Declaration
I hereby declare that I am the sole author of this thesis. The results in this work
are not complete copies of any other works.
STUDENT
Pham Van Toan
Acknowledgements
I would like to extend my deepest and heartfelt gratitude to the individuals and
groups mentioned below. I am fully aware that without their unwavering support,
guidance, and wholehearted assistance, I would not have been able to complete my
master’s thesis as I stand today.
Since embarking on my journey of pursuing a master’s degree at Hanoi University of
Science and Technology, I have never felt alone for a single moment. The steadfast
support, encouragement, and solid backing from my family, loved ones, and mentors
have empowered me to overcome every challenge and accomplish my master’s thesis.
To my family and beloved relatives, from the very beginning when I decided to
pursue my dream of higher education, they dedicated all their love and care to
guide me. Despite their busy schedules and lives, they have always been there,
providing encouragement and believing that I could conquer all obstacles. Thank
you to everyone for your unwavering trust and support.
I wish to express my sincerest gratitude to Dr. Dinh Viet Sang. There are no words
that can adequately convey my appreciation and gratitude for you, sir. You have not
only imparted valuable knowledge but also provided me with numerous thoughtful
pieces of advice in both work and life. Your knowledge, skills, and dedication have
served as a driving force that helped me overcome the challenges during my research
and thesis work. You have inspired me and guided me to delve deeper into the field
of semi-supervised learning, helping me understand the significance of the work I
was undertaking and its impact on the healthcare community.
To my teachers, colleagues, and friends within the VINIF group, who supported me,
shared knowledge, and assisted me throughout the research and thesis completion
process. The contributions of these mentors and peers not only helped me grasp
expertise but also sparked creative ideas and novel approaches to problem-solving.
This thesis is not merely my personal achievement but also the result of unity,
sharing, and support from everyone in my life and educational journey. I am aware
that the knowledge and experiences I have gathered will not only benefit myself but
also make a positive contribution to society.
Once again, I sincerely thank my family, loved ones, Dr. Dinh Viet Sang, and all
members of the VINIF group. I promise to continue to strive, learn, and contribute
to building a brighter future.
Yours sincerely,
Pham Van Toan
Contents
Contents
Abstract
List of Figures
List of Tables
List of Acronyms
1 Introduction 1
1.1 General introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Related Research and Prior Work 4
2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . 5
2.2 Polyp Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Polyp Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3.1 Feature Pyramid Network (FPN) with DenseNet169
Backbone . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.4 Regularization techniques . . . . . . . . . . . . . . . . . . . . 10
2.2.4.1 Data augmentation . . . . . . . . . . . . . . . . . . . 10
2.2.4.2 Batch normalization . . . . . . . . . . . . . . . . . . 11
2.2.4.3 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.4.4 Weight decay . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Semi Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Mean Teacher . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Consistency Regularization . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Momentum Network . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 Pseudo Labeling . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.4.1 Online Pseudo Labeling . . . . . . . . . . . . . . . . 17
2.3.4.2 Offline Pseudo Labeling . . . . . . . . . . . . . . . . 17
2.3.4.3 Leveraging Momentum Networks for Stability . . . . 18
3 Proposed Method 19
3.1 Online Pseudo Labeling with Momentum Network . . . . . . . . . . . 19
3.1.1 Overall Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1.1 Training Teacher model . . . . . . . . . . . . . . . . 21
3.1.1.2 Training Student model . . . . . . . . . . . . . . . . 22
3.1.2 Semi-supervised learning with Online Pseudo Labeling . . . . 23
3.1.2.1 Main algorithm . . . . . . . . . . . . . . . . . . . . . 24
3.1.2.2 Update momentum network . . . . . . . . . . . . . . 24
3.1.2.3 Loss function . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Mixed Momentum Model Committee - M3C Polyp . . . . . . . . . . 26
3.2.1 Overall Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Mixed Momentum Model Committee (M3C) . . . . . . . . . . 28
3.2.2.1 Formation of the M3C . . . . . . . . . . . . . . . . . 28
3.2.2.2 Integration of M3C in Semi-Supervised Learning . . 28
3.2.2.3 Uncertainty Estimation . . . . . . . . . . . . . . . . 29
3.2.2.4 Enhanced Robustness and Accuracy . . . . . . . . . 29
3.2.3 Uncertainty Estimation Based on M3C . . . . . . . . . . . . . 29
3.2.3.1 Monte Carlo Dropout-Inspired Uncertainty Estimation 30
3.2.3.2 Ensemble Prediction with M3C . . . . . . . . . . . . 30
3.2.3.3 Entropy-Based Uncertainty Score . . . . . . . . . . . 30
3.2.3.4 Interpreting Uncertainty Scores . . . . . . . . . . . . 31
3.2.4 Combine Loss for Effective Semi-Supervised Training . . . . . 31
3.2.4.1 Supervised Loss (L
sup
) . . . . . . . . . . . . . . . . . 31
3.2.4.2 Semi-Supervised Loss (L
semi
) . . . . . . . . . . . . . 31
3.2.4.3 Consistency Regularization Loss (L
con
) . . . . . . . . 32
3.2.4.4 Combining Losses for Holistic Training . . . . . . . . 32
4 Experiments and Results 33
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.2 Create labeled and unlabeled data . . . . . . . . . . . . . . . . 35
4.2.3 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.3.1 Weak Augmentation on Labeled Dataset . . . . . . . 36
4.2.3.2 Strong Augmentation on Unlabeled Dataset . . . . . 37
4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Implementation Detail . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.1 Training teacher model in a supervised manner in labeled data 39
4.4.2 Training Student with Offline Pseudo Labeling (Semi-Supervised) 39
4.4.3 Training Student with Online Pseudo Labeling (Semi-Supervised) 40
4.4.4 System configuration . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.1 Quantitative results . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.1.1 Effectiveness of momentum network . . . . . . . . . 41
4.5.1.2 Effectiveness of online pseudo labeling . . . . . . . . 42
4.5.2 Comparison with Different Supervised Methods . . . . . . . . 43
4.5.3 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.3.1 Comparison between offline and online pseudo labeling 44
4.5.3.2 Comparison with Different Supervised Methods . . . 45
4.5.4 Generalization of our proposed method . . . . . . . . . . . . . 47
4.5.4.1 In-Domain Data Evaluation . . . . . . . . . . . . . . 47
4.5.4.2 Out-of-Domain Data Evaluation . . . . . . . . . . . 47
5 Conclusion and future work 49
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
References 52
Abstract
In recent years, deep learning techniques have demonstrated impressive results and
achieved significant success in various applications. However, they often require a
large amount of appropriately labeled data to achieve reliable results. Collecting and
annotating large datasets for tasks can be extremely costly, time-consuming, and
error-prone, especially in the case of medical image data. These datasets not only
demand significant effort for labeling but also require domain-specific knowledge
from annotators. Meanwhile, there is a wealth of unlabeled medical image data
readily available in practice, which can be a valuable resource if utilized for training
AI models. One potential method to leverage unlabeled data in AI model training
is semi-supervised learning (SSL).
This thesis focuses on the application of semi-supervised learning in the field of
Medical Image Segmentation (MIS), with the aim of maximizing the utilization
of unlabeled data to improve the medical image segmentation process, specifically
targeting polyp segmentation. Polyp segmentation is crucial in aiding diagnosis and
monitoring of pathologies, helping medical experts gain a better understanding of
the structure and characteristics of polyps and other relevant regions.
This thesis introduces a novel algorithms named Online pseudo labeling with
momentum network and an extended version of this named Mixed Momen-
tum Model Committee - M3C . This algorithm combines semi-supervised learn-
ing techniques with the ability to automatically generate pseudo labels (pseudo la-
beling), harnessing the information from large unlabeled datasets effectively. This
method has significantly improved polyp segmentation capabilities and has the po-
tential to enhance medical applications. Furthermore, the thesis conducts research
and analysis of experimental results to determine the effectiveness of this algorithm
compared to traditional methods. The results demonstrate that the proposed semi-
supervised learning method outperforms fully supervised approaches and exhibits
better generalization on out-of-domain datasets.
Keywords: Semi-supervised learning, Semantic Segmentation, Polyp Segmentation
List of Figures
2.1 A simplistic figure illustrating a CNN architecture for polyp segmen-
tation. Source [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Illustrating the Semantic Segmentation Task. Source [2] . . . . . . . . 6
2.3 Polyp Segmentation Pipeline using Deep Neural Network. The input
is a RGB image, the output is the binary mask with white pixels are
polyp region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Architecture of the Feature Pyramid Network (FPN) for polyp detect
and segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Examples of data augmentation used in polyp segmentation. Source [3] 11
2.6 Regularization techniques: Dropout. Source [2] . . . . . . . . . . . . 12
2.7 Overview of Mean Teacher algorithm for image classification problem.
Source [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Diagram for pseudo labeling and the consistency regularization on
the unlabeled target samples. Source [5] . . . . . . . . . . . . . . . . 15
2.9 Basic pipeline of pseudo labeling in semi-supervised learning, Source
[6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Overview of online pseudo labeling with momentum network pipeline.
This figure offers a condensed view of the complete semi-supervised
polyp segmentation process. It showcases the main stages: teacher
model training, online pseudo label generation, and student training,
culminating in the segmentation outcome. The visual representation
encapsulates the critical steps leading to enhanced segmentation ac-
curacy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Training the Teacher Model with labeled data using supervised loss
and creating the Mixed Momentum Model Committee. . . . . . . . . 27
3.3 Training the Semi-Supervised Model with Online Pseudo Labeling
and Uncertainty Estimation via Mixed Momentum Model Committee
- M3C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Examples from five polyp segmentation datasets include the image
and corresponding ground truth mask from each dataset . . . . . . . 34
4.2 Example of our data separation strategy in difference folds . . . . . . 36
4.3 Weak augmentation in labeled images . . . . . . . . . . . . . . . . . . 37
4.4 Strong augmentation in unlabeled images . . . . . . . . . . . . . . . . 38
4.5 Qualitative result comparison between offline pseudo labeling and on-
line pseudo labeling, with/without using momentum network. . . . . 44
4.6 Comparison with Different Supervised Methods . . . . . . . . . . . . 45
4.7 Effective of the semi-supervised method in ETIS-LaribPolypDB dataset
- an out-of-domain with training data. Each column represents a dif-
ferent feature map and binary mask of supervised (using complete
labeled data) and semi-supervised model (using 20% of labeled data
and remaining as unlabeled). (a) and (b) are the input image and
corresponding ground truth, (c) and (d) are the GradCAM’s visu-
alization of the segmentation head, and the output binary mask of
the supervised model, (e) and (f) are the GradCAM’s visualization
of segmentation head and the output binary mask of semi-supervised
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.8 Performance on in-Domain data of our proposed method and different
supervised methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.9 Performance on out-of-Domain data of our proposed method and dif-
ferent supervised methods . . . . . . . . . . . . . . . . . . . . . . . . 48
List of Tables
4.1 A comparison of the original teacher with the momentum model
teacher in different ratios of labeled data . . . . . . . . . . . . . . . . 41
4.2 A comparison of the online pseudo labeling and offline pseudo labeling
strategy for semi-supervised training . . . . . . . . . . . . . . . . . . 42
4.3 A comparison of our method with state-of-the-art supervised models . 43
List of Acronyms
AI Artificial Intelligence
ML Machine Learning
DL Deep Learning
NN Neural Network
DNN Deep Neural Network
CNN Convolutional Neural Network
FPN Feature Pyramid Network
SSL Semi supervised learning
CR Consistency Regularization
EMA Exponential Moving Average
PL Pseudo Labeling
MC Dropout Monte Carlo Dropout
M3C Mixed Momentum Model Committee
Chapter 1
Introduction
1.1 General introduction
The accurate segmentation of colon polyps is a critical aspect of medical diagnostics,
particularly in preventing the progression of benign growths into potentially fatal
colon cancer. Colonoscopy, a widely employed technique, serves as a primary means
of detecting colon polyps. However, the application of deep learning has recently
demonstrated remarkable success in automating this process, significantly reducing
the time required for medical practitioners.
Nonetheless, the effectiveness of deep learning models hinges heavily upon the
availability of labeled data, a challenge compounded by the intricacies of medical
datasets. The creation of labeled medical datasets for training semantic segmen-
tation models is labor-intensive and demands annotators with specialized medical
knowledge. To address this challenge, semi-supervised learning has emerged as a
valuable approach. This method leverages extensive unlabeled data to train deep
neural networks for practical tasks, which is particularly pertinent in the medical
image domain, given the high cost and expertise required for data annotation.
Semi-supervised learning is a method that utilizes a massive amount of unlabeled
data for training deep neural networks on actual tasks. This method is significant
with medical image data since the cost of labeling the data is often high and requires
much effort from many experts. Many studies applied semi-supervised learning
on medical image data, such as pseudo labeling [7], cross pseudo supervision [8],
few-shot learning [9], deep adversarial learning, and so on. These methods use
both labeled and unlabeled data to train deep learning models for the semantic
segmentation task. Most of them aim to generate high-quality pseudo labels and
have more robust representations of the domain distribution of the data. Previous
works [10] have shown the effectiveness of the momentum network - a slow copy
version of the training model by taking an exponential moving average. Empirical
results show that the momentum network often produces more stable results than
1
the original model trained directly through the back-propagation process.
In the realm of semi-supervised learning, pseudo label generation methods can be
categorized as online or offline. Online pseudo labeling, as demonstrated in this
thesis, offers simplicity and efficiency, particularly when the pseudo label generation
model is iteratively updated to enhance the quality of pseudo labels after each
training cycle. Our thesis introduces a novel approach that combines online pseudo
labeling with momentum networks to improve the quality of pseudo labels, resulting
in a significant 3% enhancement in Dice Score compared to the offline pseudo labeling
approach. Additionally, our method outperforms supervised models on certain out-
of-domain datasets.
1.2 Objectives
The aim of this thesis is to introduce a novel semi-supervised learning algorithm
for polyp segmentation, aiming to significantly enhance its performance compared
to existing methods. By leveraging the combination of online pseudo labeling and
momentum networks, this proposed algorithm seeks to improve the quality of pseudo
labels generated during each training iteration, ultimately resulting in more accurate
and robust polyp segmentation.
1.3 Main contributions
The main contributions of this study are as follows:
The proposal of a pioneering training strategy for semi-supervised learning
that fuses online pseudo labeling and momentum networks to advance polyp
segmentation. Our results showcase a substantial 3% improvement in Dice
Score over the offline pseudo labeling method, along with superior performance
compared to supervised models on select out-of-domain datasets.
A comprehensive analysis of the influence of online pseudo labeling and mo-
mentum networks on our results.
The introduction of the first study combining online pseudo labeling and mo-
mentum networks for the specific task of polyp segmentation.
1.4 Outline of the thesis
The rest of this thesis is organized as follows:
2
Chapter 2 Provides an overview of key concepts in deep learning, the semantic seg-
mentation problem, and semi-supervised learning, forming the foundational tech-
niques applied throughout this thesis.
Chapter 3 Elaborates on the proposed method and its application to the Polyp
Segmentation task, emphasizing the details of how it enhances polyp segmentation
performance.
Chapter 4 Presents experimental results and evaluations of the proposed method,
showcasing its effectiveness and performance compared to existing approaches.
Chapter 5 Concludes the thesis, summarizing the key findings, contributions, and
potential future directions in the field of medical image segmentation and semi-
supervised learning.
3
Chapter 2
Related Research and Prior Work
2.1 Deep Learning
2.1.1 Neural Network
Artificial neural networks (ANNs) represent a class of machine learning frameworks
that draw inspiration from the intricate neural networks found in biological brains
[11]. ANNs are designed with the objective of ”learning” to process complex data
inputs and execute tasks autonomously, relying solely on examples provided during
the training phase, mirroring human cognitive processes.
An ANN [12] consists of a network of interconnected units or nodes, referred to
as artificial neurons, as they emulate the behavior of neurons in biological neural
networks. These artificial neurons are linked, akin to synapses in biological brains,
enabling the transmission of signals between them. In typical ANN implementations,
input signals are represented as real numbers, and each neuron computes an output
value through the application of a non-linear function to the sum of its inputs,
weighted and biased accordingly. The learnable parameters in ANNs encompass
these weights and biases, and an optimization process is employed to find local
minima (ideally, global minima) based on a specified loss function that generalizes
across input data.
Typically, neurons are organized into layers, and the synaptic functions may vary
from one layer to another. Within an ANN, information flows from the initial
input layer to the ultimate output layer, traversing through various intermediate
neuron layers. Following this journey, a loss function is computed. In most cases,
this loss function is differentiable concerning the network weights, facilitating the
optimization of all parameters through a backpropagation algorithm.
4
2.1.2 Convolutional Neural Network
Convolutional Neural Networks (CNNs) [13] are a fundamental component in the
realm of image segmentation tasks. A typical CNN architecture consists of multiple
layers, including convolutional layers, pooling layers, and fully connected layers.
The core operation of a convolutional layer is the convolution operation (), which
applies a set of learnable filters (W ) to the input image (I) to generate feature maps
(F ). This operation can be mathematically expressed as:
F (x, y) = (W I)(x, y) =
k
X
i=1
k
X
j=1
W (i, j) · I(x i, y j)
where k represents the size of the convolutional kernel. The feature maps produced
by convolutional layers capture hierarchical patterns and local features in the in-
put image. Subsequent pooling layers reduce the spatial dimensions of the feature
maps, helping to maintain translation invariance and reduce computational com-
plexity. These transformed features are then fed into fully connected layers for
classification or segmentation tasks. CNNs excel in learning hierarchical represen-
tations and have been instrumental in achieving state-of-the-art results in medical
image segmentation tasks, including the segmentation of medical conditions such as
polyps. Figure 2.1 illustrates a typical CNN architecture for polyp segmentation.
Figure 2.1: A simplistic figure illustrating a CNN architecture for polyp segmenta-
tion. Source [1]
5
2.2 Polyp Segmentation
2.2.1 Semantic Segmentation
Semantic segmentation is a crucial task in computer vision, aiming to assign a
specific class label to each pixel in an input image. This technique provides fine-
grained information about the objects present in the image, enabling applications
in autonomous driving, medical imaging, and object recognition.
With an input RGB (or grayscale) image, our objective is to generate an output
segmentation map in which each pixel is assigned a class label represented as an
integer value. Achieving this goal through supervised learning entails utilizing a
dataset consisting of images paired with their respective masks, as illustrated in
Figure. 2.2
Figure 2.2: Illustrating the Semantic Segmentation Task. Source [2]
In semantic segmentation, each pixel p in an input image is associated with a class
label L, indicating the category of the object it belongs to. Formally, given an
image I, semantic segmentation seeks to produce an output label map M, where
M(p) represents the predicted class label for pixel p. The label map M is generated
by applying a segmentation model F to the input image I:
M = F (I)
In this equation, F is typically a convolutional neural network (CNN) with learnable
parameters that learns to map the input image to the corresponding label map. The
model is trained using labeled training data, where both input images and their
corresponding pixel-wise class labels are provided. The training process aims to
minimize a loss function L that measures the dissimilarity between the predicted
label map M and the ground truth label map M
gt
:
L =
X
p
(M(p), M
gt
(p))
6
Here, is a per-pixel loss function, such as cross-entropy loss, used to penalize the
misclassification of pixels. The model is optimized to minimize this loss, result-
ing in accurate pixel-level class predictions. Semantic segmentation has numerous
applications, including object detection, scene understanding, and medical image
analysis.
2.2.2 Polyp Segmentation
Polyp segmentation is a vital task in medical imaging, particularly within the field
of gastrointestinal endoscopy. Its primary objective is to precisely identify and de-
lineate the boundaries of polyps, abnormal tissue growths, within endoscopic images
of the gastrointestinal tract. This task holds immense clinical significance as it en-
ables early detection of colorectal cancer, one of the leading causes of cancer-related
deaths worldwide. By identifying and removing polyps during routine colonoscopy
screenings, the risk of cancer development can be significantly reduced, leading
to improved patient outcomes, reduced healthcare costs, and enhanced endoscopic
procedures. Moreover, it aligns with the data-driven medicine trend, harnessing
the power of artificial intelligence to analyze extensive endoscopic data and provide
valuable insights for healthcare professionals, ultimately advancing GI disease diag-
nosis and treatment. The polyp segmentation problem can be effectively addressed
using a deep neural network, as illustrated in Figure 2.3
Figure 2.3: Polyp Segmentation Pipeline using Deep Neural Network. The input is
a RGB image, the output is the binary mask with white pixels are polyp region
Polyp segmentation has been approached using various methods, including super-
vised and semi-supervised techniques. Supervised methods have traditionally been
employed for polyp segmentation, relying on labeled data for training. These meth-
ods utilize annotated images to train models, such as convolutional neural networks
(CNNs) or Transformer to accurately segment polyps. Some famous supervised
methods in polyp segmentation task includes Unet [14], Unet++ [15], PraNet [16],
MSNET [17], HarDNet-MSEG [18], and ColonFormer [19]. While effective, the re-
liance on labeled data limits their applicability, as obtaining large-scale annotated
7
datasets can be challenging and time-consuming.
In recent years, semi-supervised methods have gained attention to overcome the
limitations of supervised approaches. These techniques leverage both labeled and
unlabeled data to improve segmentation accuracy. Semi-supervised methods often
incorporate additional information, such as spatial or contextual cues, to enhance
polyp segmentation. Notable approaches include consistency regularization [20, 21],
where models are encouraged to produce consistent predictions under different per-
turbations, and pseudo-labeling [22, 23], which involves iteratively assigning labels
to unlabeled data using a pre-trained model.
In this thesis, we delve into the details of polyp segmentation and its relevance in
medical imaging. We explore the evolution of both supervised and semi-supervised
approaches in this domain, highlighting their strengths and limitations. Further-
more, we examine the key techniques employed in these methods to achieve accu-
rate and robust polyp segmentation. By understanding these techniques, we pave
the way for the introduction of our novel approach, Online Pseudo Labeling
with Momentum Network, which aims to address the challenges posed by polyp
segmentation using semi-supervised learning.
2.2.3 Architectures
2.2.3.1 Feature Pyramid Network (FPN) with DenseNet169 Backbone
In this thesis, we employ the Feature Pyramid Network (FPN) [24] with a DenseNet169
[25] backbone as the primary architecture for our semantic segmentation task. This
combination is pivotal in addressing the challenges of polyp segmentation in medical
images, offering a robust and efficient solution.
DenseNet169, part of the DenseNet [25] family of convolutional neural networks,
has established itself as a formidable choice for various computer vision tasks. Its
architecture, featuring dense connections between layers, enhances gradient flow and
enables the model to effectively capture intricate image details. The DenseNet169
backbone, with its 169 layers, excels at learning hierarchical features from input
images. The densely connected blocks ensure that features from earlier layers are
seamlessly integrated with those from later layers. This design trait allows the
network to preserve fine-grained information while learning high-level abstractions,
making it well-suited for semantic segmentation.
The Feature Pyramid Network (FPN) [24, 26] is a neural network architecture de-
signed to address the challenge of object detection and semantic segmentation at
multiple scales within an image. FPN accomplishes this by creating a feature pyra-
mid with multiple levels, where each level contains features at different spatial res-
olutions. This hierarchy of features enables the network to detect objects of various
8
sizes effectively. FPN is particularly beneficial in scenarios where objects exhibit
a wide range of scales, which is often the case in medical image analysis, includ-
ing polyp segmentation. By providing a multi-scale feature representation, FPN
helps the network capture both local details and global context, improving the seg-
mentation accuracy. Figure 2.4 illustrates the architectural framework of FPN for
addressing segmentation challenges.
Figure 2.4: Architecture of the Feature Pyramid Network (FPN) for polyp detect
and segmentation
In this thesis, we integrate the DenseNet169 backbone with the FPN architecture to
exploit the strengths of both components. DenseNet169 acts as the feature extractor,
processing input images and extracting features at multiple scales. These features
are then incorporated into the FPN’s pyramid structure, where they are fused and
utilized to perform semantic segmentation.
Our selection of this architecture is guided by its established performance in various
computer vision applications and its adaptability to scenarios with limited anno-
tated data, such as medical image analysis. The combination of DenseNet’s dense
connectivity and FPN’s multi-scale feature representation allows our model to excel
in capturing fine-grained details and global context, which are vital for precise polyp
segmentation.
In the subsequent sections, we delve into the specific adaptations and enhancements
we apply to this architecture to tailor it to the challenging task of polyp segmen-
tation. Additionally, we discuss the incorporation of semi-supervised learning tech-
niques, which significantly contribute to improving segmentation performance and
robustness in medical image analysis.
9
2.2.4 Regularization techniques
In the realm of machine learning, achieving the ability to perform effectively on
previously unobserved inputs, often referred to as generalization, is paramount. To
tackle this challenge, we delve into the realm of regularization techniques, which play
a pivotal role in enhancing the generalization capabilities of machine learning mod-
els. These techniques are indispensable for preventing overfitting, a common pitfall
where models become overly tailored to the training data, resulting in poor perfor-
mance on unseen data. Regularization serves as a safeguard against this by imposing
constraints on the model’s learning process, steering it away from over-reliance on
idiosyncrasies within the training data. Through this exploration of regularization
techniques, we unlock the potential to bolster the robustness, reliability, and adapt-
ability of machine learning models, ensuring their utility in real-world applications
where generalization is the key to success.
2.2.4.1 Data augmentation
Data augmentation emerges as a potent weapon against the notorious adversary of
generalization known as overfitting. Overfitting occurs when a model becomes too
accustomed to the idiosyncrasies of the training data, failing to extend its knowledge
effectively to unseen inputs. Data augmentation steps in as a savior by introducing
subtle modifications to the original dataset, thereby generating diverse versions of
the same data. While these variants stem from a common source, the model remains
oblivious to this fact, treating each input as novel information. This proliferation
of data, coupled with its inherent variability, imbues the model with a broader
perspective and prevents it from fixating on specific data points. Instead, it fosters
a more holistic and adaptable learning process, enhancing the model’s ability to
generalize effectively to a wide array of real-world scenarios.
In the context of polyp segmentation, the challenge often lies in obtaining a suffi-
ciently diverse and large dataset of annotated medical images. Limited availability
of labeled data can hinder the training of deep learning models and potentially lead
to overfitting on the existing samples. Data augmentation emerges as a powerful
technique to address this issue. By applying various transformations and pertur-
bations to the available polyp images, data augmentation generates an augmented
dataset with increased diversity. These modified images offer different perspectives
of polyps, helping the segmentation model generalize better to unseen cases. In
essence, data augmentation enriches the training data, enabling the model to learn
a broader range of polyp appearances and improving its ability to accurately segment
polyps in real-world medical images.
Figure 2.5 illustrates the data augmentation techniques applied to the original polyp
10
Figure 2.5: Examples of data augmentation used in polyp segmentation. Source [3]
image (Figure 2.5.a). Our model employs a range of augmentation methods, includ-
ing vertical flipping (Figure 2.5.b), horizontal flipping (Figure 2.5.c), random rota-
tion within -45 to 45 degrees (Figure 2.5.d), random scaling from 0.5 to 1.5 (Figure
2.5.e), random shearing within -16 to 16 degrees (Figure 2.5.f), random Gaussian
blurring with a sigma of 3.0 (Figure 2.5.g), random contrast normalization between
0.5 and 1.5 (Figure 2.5.h), random brightness adjustments spanning from 0.8 to
1.5 (Figure 2.5.i), as well as random cropping and padding by 0–25% of the im-
age’s height and width (Figure 2.5.j). These augmentation techniques enhance the
diversity of the training data and assist the model in learning to segment polyps
effectively under various conditions.
2.2.4.2 Batch normalization
Batch normalization is a fundamental technique in deep learning, involving the
normalization of each layer’s inputs using the mean and variance within the current
mini-batch. This operation offers several advantages, including faster training, the
ability to employ higher learning rates, improved weight initialization, enhanced
viability of activation functions, enabling the construction of deeper models, and
typically yielding superior results.
2.2.4.3 Dropout
Dropout is a regularization technique widely used in deep learning to prevent over-
fitting and improve the generalization of neural networks. It works by randomly
deactivating a fraction of neurons during each forward and backward pass of training
(Figure. 2.6). This process introduces uncertainty and noise into the network,
making it more robust and preventing it from relying too heavily on specific neurons.
Mathematically, dropout is implemented by multiplying the activations of each neu-
ron by a binary mask during training. The binary mask has a value of 1 with
probability p (the dropout rate) and 0 with probability 1 p. This process can be
11
Figure 2.6: Regularization techniques: Dropout. Source [2]
represented as:
output =
input · mask
p
During inference or testing, dropout is typically turned off, and the output is scaled
by p to ensure consistent behavior. Dropout can be applied to various layers in a
neural network, including fully connected layers, convolutional layers, and recurrent
layers. It helps prevent co-adaptation of neurons and encourages the network to
learn more robust and generalized features. One of the key advantages of dropout
is its simplicity and effectiveness in improving model performance. It has become a
standard technique in training deep neural networks and is often used in conjunc-
tion with other regularization methods to achieve state-of-the-art results in various
machine learning tasks.
2.2.4.4 Weight decay
Weight decay is a regularization technique employed in machine learning to curb
model complexity, promoting robustness and enhanced generalization. To achieve
this, it encourages the model to maintain small parameter values. One common
approach to penalize complexity is by adding the squared values of all parameters
(weights) to the loss function. However, this can lead to issues because some pa-
rameters are positive, while others are negative. A better approach involves adding
the sum of squared parameters to the loss function, but then scaling it down by a
hyper-parameter known as the weight decay.
Mathematically, weight decay is applied as follows:
Loss = Original Loss + λ
X
i
w
2
i
Where:
Loss represents the modified loss function.
12
Original Loss corresponds to the original loss without regularization.
λ is the weight decay coefficient, controlling the strength of regularization.
w
i
signifies the individual model parameters.
By incorporating weight decay, we strike a balance between discouraging overly
complex models and avoiding the extreme scenario where all parameters become
zero. This regularization technique is especially valuable in deep learning to improve
model generalization and mitigate overfitting.
2.3 Semi Supervised Learning
One of the disadvantages of supervised learning is the requirement of massive amounts
of labeled data. This problem becomes even more difficult with medical image data
such as polyp segmentation because it requires effort from people with experience
and expertise in imaging diagnostic. Semi-supervised learning solves this problem
by combining labeled and unlabelled data during the training model. General ob-
jective of semi-supervised learning is improving base model trained on dataset of N
labelled samples D
sup
= {(x
n
, y
n
)|n = 1...N} by utilizing a dataset of M unlabeled
samples D
unsup
= {x
m
|m = N + 1...N + M}. It leverages additional information,
such as spatial and contextual cues, to enhance the model’s segmentation accuracy.
Three prominent techniques within the realm of semi-supervised learning are mean
teacher, consistency regularization, and pseudo-labeling [27]..
Mean Teacher introduces the concept of a teacher model that provides con-
sistent guidance to the student model during training. The teacher model is
an exponential moving average of the student’s weights, ensuring stability and
robustness.
Consistency Regularization encourages models to produce consistent pre-
dictions under various data perturbations, fostering improved generalization.
By enforcing consistency between predictions on different versions of the same
input, the model becomes more resilient to variations.
Pseudo-labeling involves iteratively assigning labels to unlabeled data us-
ing a pre-trained model. This iterative approach refines the pseudo-labels,
allowing the model to learn from the unlabeled dataset progressively.
In this section, we delve into these semi-supervised techniques, illustrating their
significance and application in medical image segmentation. By harnessing the po-
tential of both labeled and unlabeled data, semi-supervised learning paves the way
13
for more accurate and efficient segmentation models, ultimately contributing to ad-
vancements in medical diagnosis and treatment.
2.3.1 Mean Teacher
The Mean Teacher [4] algorithm is a powerful semi-supervised learning technique
that plays a crucial role in enhancing the robustness and performance of medical
image segmentation models. This method introduces the concept of a ”teacher”
model, which guides the training process of a ”student” model (Figure. 2.7).
During training, the Mean Teacher algorithm maintains a moving average of the
student model’s weights, creating a teacher model with more stable and generalized
knowledge. This teacher model serves as a source of consistency in the training
process. The objective of Mean Teacher is to minimize the discrepancy between the
predictions of the student and teacher models, effectively enforcing consistency in
their outputs.
Figure 2.7: Overview of Mean Teacher algorithm for image classification problem.
Source [4]
Mathematically, the loss function for Mean Teacher can be expressed as follows:
L
MT
= E [KL(P
teacher
P
student
)]
Where:
L
MT
represents the Mean Teacher loss.
E denotes the expectation.
KL represents the Kullback-Leibler divergence.
P
teacher
and P
student
denote the probability distributions of predictions made
by the teacher and student models, respectively.
14
The Mean Teacher algorithm encourages the student model to produce predictions
that are consistent with those of the teacher model. This consistency regularization
enhances the model’s ability to generalize to unseen data, making it particularly
beneficial for medical image segmentation tasks where labeled data is scarce. By
leveraging the strengths of both models, Mean Teacher contributes significantly to
the development of accurate and robust segmentation models in medical imaging,
ultimately aiding in better disease diagnosis and treatment planning.
2.3.2 Consistency Regularization
Consistency regularization is a fundamental technique in semi-supervised learning
that aims to improve the generalization of deep neural networks, especially in sce-
narios with limited labeled data, such as medical image segmentation. This method
encourages models to produce consistent predictions when presented with slightly
perturbed inputs. The rationale behind consistency regularization is that if the
model’s predictions remain stable under small input variations, it is more likely to
make accurate predictions on unseen data.
The key idea behind consistency regularization is to penalize the divergence between
the predictions made on the original data and the predictions made on augmented or
perturbed versions of the same data (Figure. 2.8). This perturbation can take var-
ious forms, such as introducing random noise, applying geometric transformations,
or using dropout.
Figure 2.8: Diagram for pseudo labeling and the consistency regularization on the
unlabeled target samples. Source [5]
Mathematically, the consistency loss can be defined as follows:
L
consistency
= E
KL(P (f(x)) P (f (x
)))
Where:
15
L
consistency
represents the consistency loss.
E denotes the expectation.
KL represents the Kullback-Leibler divergence.
P (f(x)) and P (f(x
)) denote the probability distributions of model predictions
for the original and perturbed inputs x and x
, respectively.
In this context, f (x) represents the output of the segmentation model for the original
input x. The consistency regularization loss encourages f (x) and f(x
) to be close
in terms of their probability distributions.
Consistency regularization has proven highly effective in medical image segmen-
tation tasks. By encouraging models to provide consistent predictions for similar
inputs, it helps reduce overfitting and enhances the model’s ability to generalize to
unseen patient data. This regularization technique, when combined with labeled and
unlabeled medical images, contributes to the development of accurate and robust
segmentation models, improving the quality of medical diagnoses and treatment
planning.
2.3.3 Momentum Network
A momentum network, a concept rooted in Exponential Moving Average (EMA),
plays a crucial role in enhancing the stability and generalization of deep learning
models, particularly in the context of semi-supervised learning and medical image
segmentation. In the training process, a momentum network is essentially a slow-
copy version of the weights from the original model. This mechanism introduces
a smoothing effect on the model’s weight updates, contributing to a more stable
convergence. The EMA update equation, given as:
θ
t
= αθ
t1
+ (1 α)θ
t
Here, θ
t
and θ
t
represent the weights of the momentum and original models, respec-
tively, at the t-th step of training. The parameter α controls the weight given to the
previous momentum model’s weights, determining the degree of smoothing applied.
This approach effectively transforms the momentum network into an ensemble of
the original model at different training time steps.
Empirical evidence from studies, such as the work by Araslanov et al. [10], demon-
strates that the incorporation of a momentum network results in significantly im-
proved training stability and accuracy. This stability is particularly valuable in
the context of medical image segmentation, where robust and reliable models are
16
essential. In essence, the momentum network acts as a stabilizing force in the train-
ing process, mitigating the risk of divergence and allowing the model to converge
more smoothly. This increased stability translates into better segmentation results,
making it a vital component of the methodology employed in this thesis.
2.3.4 Pseudo Labeling
Pseudo-label generation methods, a fundamental component of semi-supervised learn-
ing, can be categorized into two main types: online generation and offline generation.
Each of these methods presents distinct advantages and considerations in the context
of enhancing the quality of pseudo labels used in training. Figure 2.9 demonstrate
the basic pipeline of pseudo labeling in semi-supervised learning.
Figure 2.9: Basic pipeline of pseudo labeling in semi-supervised learning, Source [6]
2.3.4.1 Online Pseudo Labeling
In the online pseudo labeling approach, pseudo labels are generated directly dur-
ing the forwarding process of the model [28]. This method offers the advantage
of simplicity in implementation. It seamlessly integrates into the training pipeline,
requiring only a single-stage training with the student model. However, this con-
venience comes with a requirement for stable pseudo label quality throughout the
label generation process. Ensuring a consistent and smooth pseudo-label generation
model becomes paramount for the effectiveness of the online method.
2.3.4.2 Offline Pseudo Labeling
Conversely, the offline pseudo labeling approach generates pseudo labels once at
the outset of each semi-supervised training, with subsequent iterations preserving
17
these labels [29, 30]. This method exhibits the advantage of producing pseudo labels
with consistent quality throughout training. As the number of training iterations
increases, the quality of these labels tends to improve. However, the offline approach
necessitates more complex setup and computational resources compared to its online
counterpart.
2.3.4.3 Leveraging Momentum Networks for Stability
To address the challenge of maintaining stable pseudo label quality during the online
label generation process, this thesis employs a momentum network as a stabilizing
mechanism. The momentum network acts as a slow-copy version of the weights of
the original model, implementing Exponential Moving Average (EMA) to smooth
weight updates [10]. This stabilization ensures that the pseudo label generation
process remains consistent, enhancing the robustness and performance of the semi-
supervised learning approach. The incorporation of momentum networks represents
a critical strategy to bridge the advantages of both online and offline pseudo labeling
methods in the context of medical image segmentation.
18
Chapter 3
Proposed Method
3.1 Online Pseudo Labeling with Momentum Net-
work
In this section, we introduce our novel approach to polyp segmentation, which lever-
ages cutting-edge techniques in deep learning and semi-supervised learning. Our
proposed method represents a comprehensive framework designed to enhance the
accuracy and efficiency of polyp segmentation in medical images. We break down
our approach into three key components:
Overall Pipeline: At the heart of our method lies an intricate pipeline that
orchestrates the entire process of polyp segmentation. We will provide a de-
tailed overview of the pipeline, outlining each step and the rationale behind
it. This high-level perspective will allow readers to grasp the broader context
of our approach.
Semi-Supervised Learning with Online Pseudo Labeling A pivotal el-
ement of our approach is the incorporation of semi-supervised learning tech-
niques. Here, we delve into the specifics of our semi-supervised learning strat-
egy, focusing on our unique contribution: Online Pseudo Labeling. We will
elucidate how this technique enables the integration of unlabeled data effec-
tively and efficiently, a crucial factor in improving the model’s performance.
Training Strategy and Implementation Detail To bring our approach to
life, the devil is in the details. We will provide a comprehensive insight into the
intricacies of our training strategy, shedding light on hyperparameters, data
preprocessing, and model architecture. By understanding these implementa-
tion nuances, readers will be equipped to replicate our methodology and adapt
it to their specific tasks.
19
Our proposed method represents a significant advancement in the field of polyp
segmentation, offering a powerful toolkit for medical image analysis. With a solid
understanding of our approach’s components and inner workings, readers will be
well-prepared to explore its potential in their own research endeavors.
3.1.1 Overall Pipeline
The core of our approach to polyp segmentation lies in the intricately designed
overall pipeline. Figure. 3.1 illustrates the architecture of our system, providing an
overview of the comprehensive process we employ to achieve highly accurate polyp
segmentation in medical images.
Our methodology embraces a two-stage strategy for semi-supervised training. In
the initial stage, the teacher model undergoes training using the labeled dataset.
Throughout this training process, we meticulously save both the original model and
a meticulously updated slow copy version, realized through Exponential Moving
Average (EMA), known as the Momentum Teacher Network. These dual models
play pivotal roles in facilitating the training of the student network.
Figure 3.1: Overview of online pseudo labeling with momentum network pipeline.
This figure offers a condensed view of the complete semi-supervised polyp segmen-
tation process. It showcases the main stages: teacher model training, online pseudo
label generation, and student training, culminating in the segmentation outcome.
The visual representation encapsulates the critical steps leading to enhanced seg-
mentation accuracy.
20
The critical component in our approach is the online pseudo label generation, an
operation executed during the student network’s training. Here, we employ the Mo-
mentum Teacher Network, a crucial element in ensuring the pseudo labels maintain
stability throughout the process. Concurrently, the original student model under-
goes weight updates that synchronize with the Momentum Teacher Network and its
corresponding student momentum version.
As we traverse the various steps in our pipeline, each will be expounded upon in the
forthcoming sections. Our overall pipeline encapsulates a sophisticated yet effective
framework that orchestrates the journey from labeled data to the precise segmen-
tation of polyps in medical images. In the following subsections, we delve into the
specifics of our novel semi-supervised learning strategy, Online Pseudo Labeling, and
provide detailed insights into the training strategy and implementation, collectively
empowering readers to fully comprehend and harness the potential of our approach.
3.1.1.1 Training Teacher model
In the initial phase of our semi-supervised learning strategy, we focus on training
the teacher model, leveraging the entire labeled dataset. This critical step sets the
foundation for the subsequent stages and plays a pivotal role in guiding the student
network towards improved polyp segmentation.
During the teacher training process, we adopt the Tversky loss [31] function, a
powerful tool for optimizing model performance in image segmentation tasks. The
Tversky loss function is especially well-suited for scenarios where the dataset might
exhibit class imbalance, a common challenge in medical image analysis. Its formu-
lation involves precision and recall parameters, offering the flexibility to fine-tune
the model’s behavior according to specific requirements. The Tversky loss function
can be expressed as:
Tversky Loss =
T P
T P + α · F P + β · F N
Here, T P represents true positives, F P stands for false positives, and F N corre-
sponds to false negatives. The hyperparameters α and β allow us to control the
balance between precision and recall, tailoring the model’s focus based on the task’s
objectives.
Our primary model architecture for teacher training is the Feature Pyramid Network
(FPN) with the DenseNet169 backbone. This choice is driven by the outstanding
performance of FPN in handling multi-scale features, a crucial aspect in medical
image segmentation. The DenseNet169 backbone further enhances the model’s ca-
pacity to extract meaningful features from the input data, contributing to improved
segmentation accuracy.
21
Following teacher training, we select the best-performing momentum teacher net-
work based on its validation set performance. This carefully chosen network will
later assume a central role in generating pseudo labels during the student train-
ing phase, a key component of our semi-supervised learning strategy. The process
of teacher training not only establishes a strong foundation for our approach but
also ensures that the subsequent stages of training are guided by a highly capable
mentor, ultimately leading to enhanced polyp segmentation results.
3.1.1.2 Training Student model
The student training phase is a pivotal step in our approach, building upon the
momentum teacher model trained on the labeled dataset, D
sup
. Here, we harness
the momentum teacher’s power to generate pseudo labels for the unlabeled dataset,
D
unsup
. A unique aspect of our method is that we continuously update the weights
of the momentum teacher model during training, ensuring stable and constantly
improving pseudo labels. Some steps in this phase include:
Pseudo Label Generation (Step 1): At each epoch, we employ the trained
momentum teacher model, denoted as M T
t
, to generate pseudo labels for the
images in the unlabeled dataset, D
unsup
.
Student Training with Pseudo Labels (Step 2): The student model, S
t
,
is then trained using a combination of labeled data (D
sup
) and unlabeled data
(D
unsup
) with the generated pseudo labels.
Momentum Teacher and Student Updates (Steps 3 and 4): Crucially,
we update both the momentum teacher (MT
t
) and momentum student (MS
t
)
models’ weights based on the current student model (S
t
) using the Exponential
Moving Average (EMA). This continual refinement helps maintain the quality
and stability of pseudo labels over time.
Validation and Model Saving (Step 5): Throughout training, we track
the performance of the models on a validation set, saving the best-performing
models for use in the final segmentation.
This dynamic process leverages the strengths of both teacher and student models to
enhance the segmentation quality progressively. The stability and adaptability of
the pseudo labels, combined with the semi-supervised approach, contribute to the
success of our polyp segmentation framework.
22
3.1.2 Semi-supervised learning with Online Pseudo Label-
ing
Semi-supervised learning with online pseudo labeling constitutes a cornerstone of our
approach to polyp segmentation. Its primary purpose is to leverage both labeled
(D
sup
) and unlabeled (D
unsup
) data to improve the accuracy of polyp segmentation
models. This technique addresses the challenges of limited labeled data in medical
imaging.
Advantages of Online Pseudo Labeling:
Efficient Data Utilization: Online pseudo labeling allows us to maximize
the utilization of unlabeled data. While labeled datasets are often scarce
and expensive to acquire, medical images are often readily available in abun-
dance. Online pseudo labeling taps into this resource by dynamically gener-
ating pseudo labels during the student training process.
Dynamic Labeling: One key advantage is the dynamic nature of pseudo
label generation. Pseudo labels evolve with the model during training, contin-
uously adapting to the model’s improvements. This adaptability helps ensure
that the generated labels remain relevant and of high quality.
Reduction of Annotation Costs: By relying on a small set of labeled
data to bootstrap the learning process, this method significantly reduces the
annotation cost and time required for creating large-scale labeled datasets. It
eases the burden on medical experts, who would otherwise need to annotate
vast amounts of data.
Improved Generalization: With a combination of labeled and pseudo-
labeled data, our model benefits from the richer training set, enhancing its
generalization capabilities. The model learns to generalize not only from the
carefully labeled data but also from the diverse patterns present in the unla-
beled images.
Enhanced Segmentation Quality: By iteratively refining the pseudo labels
using a momentum network, our approach produces high-quality labels that
effectively guide the training process. This results in improved segmentation
accuracy compared to traditional supervised methods.
Semi-supervised learning with online pseudo labeling is a crucial component of our
strategy, allowing us to harness the untapped potential of unlabeled medical images
while minimizing the need for costly data annotation. Its adaptability, efficiency,
and ability to enhance model generalization make it an indispensable technique in
the context of polyp segmentation.
23
3.1.2.1 Main algorithm
The algorithmic details are laid out in Algorithm 1, summarizing the process as
follows:
Algorithm 1: Online Pseudo Labeling with Momentum Teacher Algorithm
Input: Labeled images D
sup
, Unlabeled images D
unsup
, Trained momentum
teacher MT
0
Output: Best momentum student MS
best
trained with a combination of
supervised and unsupervised data
Function OnlineLabelingMomentum():
for t = 0 to n epochs do
Step 1: Generate pseudo labels of D
unsup
with MT
t
model ;
Step 2: Train student S
t
with a combination of D
sup
and D
unsup
with
generated pseudo labels ;
Step 3: Update momentum teacher M T
t
with S
t
weights via EMA ;
Step 4: Update momentum student M S
t
with S
t
weights via EMA ;
Step 5: Save the best models in the validation set ;
end
3.1.2.2 Update momentum network
The momentum network plays a critical role in our training strategy, and its update
process varies between teacher training and student training phases.
During teacher training, we update the momentum network using a momentum
parameter of 0.9. This lower momentum ratio compared to student training is
intentional. It ensures that the momentum network leverages the high-quality in-
formation from the unlabeled data while maintaining stability. The slower update
allows the momentum network to capture the global knowledge from the dataset.
In contrast, during student training, we update the momentum network with a
higher momentum of 0.99. This adjustment is made to strike a balance between re-
taining meaningful information from the teacher model, which updates more slowly,
and incorporating new features learned by the student model. This higher momen-
tum facilitates the transfer of relevant knowledge from the teacher model to the
student while adapting to the evolving features of the student model.
3.1.2.3 Loss function
In this section, we delve into the intricate details of the loss function employed for
training within our semi-supervised framework, tailored specifically for the challeng-
ing task of polyp segmentation. Our approach leverages the Tversky loss function, a
powerful tool that allows us to effectively balance the contributions of both labeled
24
and unlabeled datasets while ensuring that pseudo labels from the unlabeled data
dynamically contribute to the training process.
To achieve this fine-tuned balance, we utilize the Tversky loss for both the labeled
and unlabeled datasets. The Tversky loss offers a versatile solution, allowing us to
control the trade-off between false positives and false negatives, making it particu-
larly well-suited for medical image segmentation tasks, where the precise delineation
of regions of interest, such as polyps, is of paramount importance.
In the context of our semi-supervised learning framework, the unlabeled dataset
plays a pivotal role. However, as this dataset lacks ground truth annotations, we
employ pseudo labels generated by the momentum teacher network during the stu-
dent training process. These pseudo labels are iteratively updated as the student
model refines its segmentation predictions. This dynamic update mechanism en-
sures that the pseudo labels gradually become more reliable and accurate, aligning
with the model’s improved segmentation capabilities.
The total loss function for our semi-supervised training comprises two essential com-
ponents: the supervised loss and the unsupervised loss. The supervised loss quan-
tifies the dissimilarity between the predicted segmentation masks and the ground
truth labels for the labeled dataset, ensuring that the model accurately captures the
annotated polyp regions. It can be expressed as:
L
sup
=
P
p
i
· g
i
P
p
i
· g
i
+ α
P
(1 p
i
) · g
i
+ β
P
p
i
· (1 g
i
)
Here, p
i
represents the predicted probability of the pixel i belonging to the polyp
class, and g
i
indicates the ground truth label (1 for polyp, 0 for non-polyp). The
hyperparameters α and β control the balance between false positives and false neg-
atives.
Simultaneously, the unsupervised loss quantifies the dissimilarity between the pseudo
labels (derived from the unlabeled dataset) and the student model’s predictions. It
encourages the model to generate more consistent and accurate pseudo labels over
time. The unsupervised loss can be defined as:
L
unsup
=
P
p
i
· ˜p
i
P
p
i
· ˜p
i
+ α
P
(1 p
i
) · ˜p
i
+ β
P
p
i
· (1 ˜p
i
)
Here, p
i
represents the predicted probability of the pixel i belonging to the polyp
class by the student model, and ˜p
i
represents the corresponding pseudo label proba-
bility. The hyperparameters α and β continue to control the trade-off between false
positives and false negatives, even in the unsupervised setting.
The total loss function in our semi-supervised learning framework plays a crucial
role in training our deep neural network for polyp segmentation. It’s designed to
strike a balance between the labeled data, which provides ground truth information,
25
and the unlabeled data, for which pseudo labels are generated iteratively during
training.
The total loss is calculated as the sum of two components: the supervised loss, which
quantifies the error between the predicted and ground truth labels for the labeled
data, and the unsupervised loss, which assesses the consistency between the model’s
predictions and the pseudo labels assigned to the unlabeled data.
L
total
= L
sup
+ α L
unsup
By setting α to 0.5, we ensure an equal contribution from both supervised and
unsupervised losses, emphasizing the importance of both labeled and unlabeled data
in training. This balanced approach helps the model generalize better, making it
more robust and capable of segmenting polyps accurately in medical images.
The combination of supervised and unsupervised losses in our total loss function
strikes a harmonious balance between leveraging the ground truth information from
the labeled dataset and the dynamically generated, refined pseudo labels from the
unlabeled dataset. This approach harnesses the collective knowledge from both
datasets to enhance the overall segmentation performance in polyp segmentation,
ultimately achieving robust and highly accurate results.
3.2 Mixed Momentum Model Committee - M3C
Polyp
In this section, we introduce our novel approach, the Mixed Momentum Model Com-
mittee for Polyp Segmentation (M3C Polyp). M3C Polyp leverages a combination
of labeled (D
l
) and unlabeled (D
u
) data to enhance the accuracy and robustness of
polyp segmentation. Labeled data is represented as (x
l
, y
l
) D
l
, while unlabeled
data is denoted as x
u
D
u
.
3.2.1 Overall Pipeline
Our proposed methodology follows a two-step pipeline, building upon our prior work
[22]. The first step, illustrated in Figure 3.2, involves training the teacher model
with labeled data using a supervised loss and forming the Mixed Momentum Model
Committee.
During this initial step, we employ weak data augmentation techniques and utilize
binary cross-entropy loss tailored for segmentation tasks. Importantly, we simulta-
neously update the momentum models in the committee using different momentum
coefficients. This results in both the teacher model and the M3C committee, which
are critical components for the subsequent step.
26
Figure 3.2: Training the Teacher Model with labeled data using supervised loss and
creating the Mixed Momentum Model Committee.
The second step, depicted in Figure 3.3, employs an online pseudo-labeling strategy
[22]. Here, we train the student model while concurrently updating the teacher
model via Exponential Moving Average (EMA) with a momentum coefficient of
0.99. We also execute the M3C update step using the same momentum coefficients.
However, in this step, we introduce an uncertainty score calculation based on the
M3C model outputs. This uncertainty score is utilized in a consistency regularization
loss, with higher weight given to images exhibiting lower uncertainty, i.e., higher
confidence.
Figure 3.3: Training the Semi-Supervised Model with Online Pseudo Labeling and
Uncertainty Estimation via Mixed Momentum Model Committee - M3C.
27
The combination of these two steps, involving the teacher model, M3C committee,
and uncertainty estimation, empowers our pipeline to enhance the accuracy and
robustness of semi-supervised learning for polyp segmentation. By dynamically
adapting to both labeled and unlabeled data, M3C Polyp demonstrates a promising
approach for improving the segmentation of polyps in medical images.
3.2.2 Mixed Momentum Model Committee (M3C)
The Mixed Momentum Model Committee (M3C) is a critical component of
our novel semi-supervised polyp segmentation approach, aimed at maximizing the
robustness and accuracy of the model. Drawing inspiration from the demonstrated
effectiveness of momentum models in generating stable pseudo-labels, we introduce
the M3C as an ensemble of momentum models to further bolster the performance
of our semi-supervised learning framework.
3.2.2.1 Formation of the M3C
The M3C is constructed by creating a committee of K momentum models, each
derived from the teacher model T . This formation occurs during the initial phase of
training. As we train the teacher model with labeled data, the weights of the mo-
mentum models in the committee are updated using distinct momentum coefficients.
Specifically, we employ the Exponential Moving Average (EMA) technique to ad-
just the weights of the teacher model after each epoch, resulting in multiple versions
of the teacher model, each with varying momentum coefficients. Consequently, the
M3C embodies an ensemble of models, each offering unique perspectives and insights
into the data, effectively capturing various uncertainties.
Mathematically, the M3C formation can be represented as:
M3C = {M
1
, M
2
, . . . , M
K
}
Where each M
k
is a momentum model derived from the teacher model T with a
specific momentum coefficient.
3.2.2.2 Integration of M3C in Semi-Supervised Learning
In the second phase of our approach, the M3C comes into play in conjunction with
the online pseudo labeling strategy [22] and uncertainty estimation. While training
the student model, we continuously update the teacher model’s weights using EMA,
with a consistent momentum coefficient of 0.99, denoted as µ. Simultaneously, the
M3C undergoes an update step with the same momentum coefficients applied during
its formation.
28
Mathematically, the EMA update of the teacher model T is given by:
θ
T
µ · θ
T
+ (1 µ) · θ
S
Where θ
T
and θ
S
represent the weights of the teacher and student models, respec-
tively.
3.2.2.3 Uncertainty Estimation
Notably, besides the primary role of generating pseudo-labels, the M3C is leveraged
to calculate an uncertainty score, denoted as U(x), for each image x based on the
collective outputs of the models within the committee. This uncertainty score is a
pivotal factor in our approach. It acts as a guiding signal for the consistency regular-
ization loss, offering a means to weigh the importance of each training sample based
on its degree of uncertainty. This approach prioritizes more confident predictions
during training, contributing to enhanced model robustness.
Mathematically, the uncertainty score U(x) can be expressed as:
U(x) = f(M3C(x))
Where f is a function that aggregates the outputs of the models in the M3C for
image x.
3.2.2.4 Enhanced Robustness and Accuracy
By integrating the M3C into our semi-supervised pipeline, we harness the collective
power of multiple momentum models. This ensemble-based strategy, coupled with
uncertainty estimation, empowers our model to produce stable pseudo-labels while
effectively utilizing unlabeled data. The M3C enhances the adaptability of our
model, improves accuracy, and bolsters generalization capabilities, making our semi-
supervised polyp segmentation framework a robust and reliable solution.
3.2.3 Uncertainty Estimation Based on M3C
In our innovative approach to semi-supervised polyp segmentation, we leverage the
Mixed Momentum Model Committee (M3C) as a valuable tool for estimating the
uncertainty associated with the model’s predictions. This uncertainty estimation
is crucial for understanding the model’s confidence and reliability, particularly in
situations where the model encounters challenging or ambiguous data.
29
3.2.3.1 Monte Carlo Dropout-Inspired Uncertainty Estimation
Inspired by the Monte Carlo Dropout technique [32], we harness the ensemble of
models within the M3C to perform uncertainty estimation. The fundamental idea
is to assess the level of uncertainty for each pixel in an image by utilizing multiple
predictions generated by the ensemble. This process offers a more comprehensive
view of the model’s decision-making process and can be highly informative in cases
where the model encounters complex or uncertain scenarios.
3.2.3.2 Ensemble Prediction with M3C
To start the uncertainty estimation process, we pass an input image x through each
model in the M3C committee, producing multiple predictions. These predictions
are then averaged across the models to create an ensemble prediction denoted as
M3C(x). The ensemble prediction represents a consensus view of the M3C commit-
tee and is formulated as follows:
M3C(x) =
1
K
K
X
k=1
M
k
(x)
In this equation, x signifies the input image, K is the number of models within
the M3C committee, and M
k
(x) represents the prediction of the k-th model in the
committee for the image x.
3.2.3.3 Entropy-Based Uncertainty Score
With the ensemble prediction M3C(x) in hand, we proceed to calculate the pixel-
wise uncertainty score, denoted as U(x), which serves as an indicator of the uncer-
tainty or confidence associated with the segmentation result for each pixel. This
uncertainty score is computed based on the concept of entropy, a widely used metric
for measuring uncertainty in probability distributions. The equation for calculating
U(x) is as follows:
U(x) =
C
X
c=1
p(c|x) log p(c|x)
Here, p(c|x) represents the probability distribution of the pixel x belonging to each
class c based on the ensemble predictions M 3C(x). C denotes the total number of
classes in the segmentation task.
30
3.2.3.4 Interpreting Uncertainty Scores
The uncertainty scores obtained through this process provide valuable insights into
the model’s behavior. Pixels with higher uncertainty scores indicate regions where
the model is less confident in its predictions or where there is significant variability
among the models within the M3C committee. These areas may correspond to
challenging or ambiguous image regions, and the associated uncertainty scores offer
a means to focus further attention or analysis.
3.2.4 Combine Loss for Effective Semi-Supervised Training
In this section, we introduce a comprehensive approach to semi-supervised learn-
ing by combining three distinct loss components. This combination of losses is a
cornerstone of our methodology, playing a pivotal role in enhancing the accuracy,
stability, and robustness of the training process. Each loss component serves a spe-
cific purpose, collectively contributing to the success of our semi-supervised polyp
segmentation framework.
3.2.4.1 Supervised Loss (L
sup
)
The first loss component, denoted as L
sup
, is applied to the labeled data. Its pri-
mary objective is to guide the model in making precise predictions on the labeled
samples. For our binary segmentation task, we utilize the Binary Cross-Entropy
loss, as defined below. This loss penalizes discrepancies between the model’s predic-
tions and the ground truth labels, effectively aligning the model’s outputs with the
provided supervision.
L
sup
=
1
N
sup
N
X
n
(y
n
log(p(y
n
|x
n
)) + (1 y
n
) log(1 p(y
n
|x
n
)))
Here, N
sup
represents the number of labeled samples, x
n
is the input data, y
n
denotes
the corresponding ground truth labels, and p(y
n
|x
n
) is the predicted probability
distribution over the binary labels.
3.2.4.2 Semi-Supervised Loss (L
semi
)
The second loss component, L
semi
, quantifies the dissimilarity between the pseudo-
labels generated by the teacher model T and the predictions made by the student
model on unlabeled data. This loss promotes consistency between the two sets of
predictions and encourages the student model to align its outputs with the pseudo-
labels provided by the teacher. It is mathematically expressed as:
31
L
semi
=
1
N
unsup
N
unsup
X
m=0
(T (x
m
) log(p(T (x
m
)|x
m
))+(1T (x
m
)) log(1p(T (x
m
)|x
m
))
In this equation, N
unsup
represents the number of unlabeled samples, x
m
denotes
the input data, T (x
m
) represents the pseudo-labels generated by the teacher model,
and p(T (x
m
)|x
m
) is the predicted probability distribution over the pseudo-labels.
3.2.4.3 Consistency Regularization Loss (L
con
)
The third loss component, L
con
, introduces a consistency regularization mechanism
to the training process. It enforces alignment between the predictions made by
the student model and the ensemble output of the M3C committee, all while con-
sidering the uncertainty score for each pixel. This pixel-wise uncertainty-weighted
consistency regularization loss is formulated as:
L
con
=
1
N
N
X
i=1
U(x
i
) · KL(p(M3C(x
i
)|x
i
)||p(x
i
|y
i
))
Here, N represents the total number of pixels, x
i
is an individual input pixel, U(x
i
)
denotes the uncertainty score associated with the pixel, M 3C(x
i
) represents the
ensemble prediction of the M3C committee, p(M3C(x
i
)|x
i
) is the predicted prob-
ability distribution over the ensemble predictions, and p(x
i
|y
i
) is the probability
distribution of the student’s output.
3.2.4.4 Combining Losses for Holistic Training
By combining these three loss components, the model simultaneously optimizes su-
pervised, semi-supervised, and consistency regularization objectives. The total loss,
denoted as L
total
, is expressed as:
L
total
= L
sup
+ α L
semi
+ β L
con
In our experiments, we set α = 0.5 and β = 1.5 to balance the contributions of
each loss component. This combination of losses fosters accurate predictions on
labeled data, alignment between pseudo-labels and predictions on unlabeled data,
and consistency between the student model and the ensemble output of the M3C
committee. The result is an optimized and robust semi-supervised training process
that significantly improves the overall performance of polyp segmentation.
32
Chapter 4
Experiments and Results
4.1 Datasets
The dataset utilized in this research is the Polyp dataset, consistent with the dataset
employed in the HardDet-MSEG study [18]. This dataset consists of a total of
1450 endoscopic images, each having dimensions of 384 × 288 × 3, accompanied
by corresponding masks measuring 384 × 288 × 1. The masks in this dataset are
binary images, where pixel values of 255 signify areas containing polyps, while pixel
values of 0 represent background regions. The dataset is partitioned into distinct
segments for training and testing. The training dataset comprises 900 images from
the Kvasir-SEG [33] dataset and an additional 550 images from the CVC-ClinicDB
[34] dataset. On the other hand, the testing dataset is comprised of 798 images,
which have been synthesized from various sources, including the Kvasir [33], CVC-
Clinic DB [34], CVC-Colon DB [35], CVC-300 [36], and ETIS-Larib Polyp DB [37].
Details of these datasets are described as follows:
Kvasir Dataset: The Kvasir dataset is an invaluable resource collected using
endoscopic equipment at Vestre Viken Health Trust (VV), Norway. Expert
gastroenterologists from VV and the Cancer Registry of Norway meticulously
annotated and verified its contents. Comprising 1000 images with varying
resolutions, ranging from 720 × 576 to 1920 × 1072 pixels, this dataset serves
as a critical component for assessing the performance of our proposed method.
CVC-ClinicDB Dataset: Derived from frames extracted from colonoscopy
videos, the CVC-ClinicDB dataset consists of 612 images, each with a reso-
lution of 384 × 288 pixels, sourced from 31 colonoscopy sequences. It played
a pivotal role in the training stages of the MICCAI 2015 Sub-Challenge on
Automatic Polyp Detection Challenge in Colonoscopy Videos, making it an
essential dataset for our research.
CVC-ColonDB Dataset: Provided by the Machine Vision Group (MVG),
33
the CVC-ColonDB dataset contributes 380 images with a resolution of 574 ×
500 pixels, originating from 15 short colonoscopy videos. Its content serves as
a valuable asset for our methodology evaluation.
CVC-300 Dataset: CVC-300 represents the test subset of the extensive
Endoscene dataset. It comprises 60 images extracted from 44 video sequences
obtained from 36 patients, adding to the diversity and complexity of the data
used in our research.
ETIS-Larib Dataset: The ETIS-Larib dataset includes 196 high-resolution
colonoscopy images, each boasting dimensions of 1226 × 996 pixels. This
dataset enhances our research’s capacity to generalize and perform effectively
across various data sources.
Figure 4.1: Examples from five polyp segmentation datasets include the image and
corresponding ground truth mask from each dataset
Figure 4.1 showcases illustrative examples from five distinct datasets employed in
our research for polyp segmentation. These examples provide a visual representation
of the diversity and complexity inherent to these datasets, each contributing unique
challenges and characteristics to our study.
To facilitate rigorous experimentation and comprehensive evaluation, a validation
dataset is meticulously created by extracting 10% of the total dataset, which corre-
sponds to 145 images. This validation subset ensures the robustness and reliability
of our experiments. The remaining 1305 images in the dataset are allocated for con-
structing both labeled and unlabeled datasets for each experiment. These datasets
serve as the foundation for assessing the performance and efficacy of the proposed
method in this study, enabling us to draw meaningful comparisons with existing
approaches.
34
4.2 Data Processing
4.2.1 Data Pre-processing
A series of image processing steps are applied to the dataset to prepare it for training
and evaluation. First, all images are resized to a standardized resolution of 352x352
pixels to ensure consistency in input size for the deep neural networks used in the
experiments.
Next, a crucial preprocessing step involves normalizing the images to have zero
mean and unit variance, following the statistics of the ImageNet dataset. This
normalization ensures that the network can effectively learn features from the images
without being biased by varying intensity levels and color distributions.
For the ground truth masks, they are converted into binary format to facilitate
pixel-wise segmentation tasks. This conversion is achieved through two methods:
binary thresholding and the Otsu method. These techniques help in accurately
distinguishing the regions containing polyps from the background, providing precise
segmentation masks for training and evaluation.
To feed both the images and their corresponding masks into the deep learning model,
they are converted into tensor format. This tensor representation is essential for
efficient data handling during the training process, allowing for batch processing
and optimization techniques that enhance the learning process.
Overall, these image processing steps ensure that the dataset is appropriately pre-
pared for training deep neural networks for polyp segmentation tasks, enabling the
models to learn robust and accurate representations of polyp structures in endo-
scopic images.
4.2.2 Create labeled and unlabeled data
In the context of this study, we detail the process of partitioning the initial dataset,
D
train
, consisting of 1305 labeled images obtained through the train/test/validation
splits, into distinct subsets for semi-supervised learning experiments. Specifically, we
establish a clear division between labeled (D
sup
) and unlabeled (D
unsup
) datasets.
First, we designate a certain percentage, such as 20%, 40%, or 60%, of the images
from D
train
to form D
sup
, a labeled dataset that retains the ground truth masks.
The remainder of D
train
, represented as D
unsup
, exclusively contains images, with
the ground truth masks omitted. Mathematically, we define D
sup
and D
unsup
as
follows:
D
sup
= p
sup
· D
train
D
unsup
= (1 p
sup
) · D
train
35
Here, p
sup
corresponds to the chosen percentage of data allocated for the labeled
dataset D
sup
, with 0 < p
sup
1, and D
train
represents the entirety of the original
labeled dataset. Figure 4.2 demonstrate the data separation step in difference fold.
Figure 4.2: Example of our data separation strategy in difference folds
This separation into labeled and unlabeled datasets serves as a critical component of
our semi-supervised learning approach, enabling us to assess the impact of different
proportions of labeled data on the performance of our polyp segmentation model.
4.2.3 Data Augmentation
In this study, we employ a two-fold data augmentation strategy, distinguishing be-
tween the labeled and unlabeled datasets:
4.2.3.1 Weak Augmentation on Labeled Dataset
For the labeled dataset, we adopt a basic augmentation approach. This weak aug-
mentation strategy introduces a degree of randomness into the training images
through random flips, each with a probability of 0.5. The primary purpose of this
weak augmentation is to enhance the robustness of the model by exposing it to
minor variations in the labeled data. Figure 4.3 visualize some weak augmentation
in labeled images
36
Figure 4.3: Weak augmentation in labeled images
4.2.3.2 Strong Augmentation on Unlabeled Dataset
In contrast, we apply a more extensive augmentation scheme to the unlabeled
dataset. This strong augmentation aims to introduce a controlled level of noise,
promoting invariance in the decision function applied to both labeled and unlabeled
data. This noise injection helps ensure that the student model, which receives strong
augmentation, remains consistent with the teacher model when generating pseudo
labels. The strong augmentation techniques include ShiftScaleRotate, RGBShift,
RandomBrightnessContrast, and RandomFlip, each applied with a probability of
0.5. This augmentation strategy contributes to the learning process by exposing
the model to a diverse range of image variations, effectively enhancing its ability
to generalize and perform well on unseen data. Figure 4.4 visualize some strong
augmentation in unlabeled images
By differentiating between weak and strong augmentation strategies for labeled and
unlabeled datasets, we aim to strike a balance between robustness and model stabil-
ity, ultimately improving the overall performance of our polyp segmentation model.
4.3 Evaluation Metrics
To assess the model’s performance and facilitate comparisons across different exper-
iments, we employ two key evaluation metrics: Mean Intersection over Union
(mIoU) and Mean Dice (mDice). These metrics provide quantitative insights
37
Figure 4.4: Strong augmentation in unlabeled images
into the accuracy and effectiveness of our polyp segmentation model.
mIoU - (Mean Intersection over Union) is defined as:
mIoU =
T P
T P + F P + F N
Where: - T P denotes True Positives, - F P represents False Positives, and - F N
signifies False Negatives.
mDice (Mean Dice), another crucial metric, is defined as:
mDice =
2 · T P
2 · T P + F P + F N
These evaluation metrics, calculated based on the model’s predictions and ground
truth, provide a comprehensive measure of segmentation accuracy. By computing
mIoU and mDice, we can quantitatively assess the model’s ability to delineate polyp
regions accurately, ultimately enabling us to make informed comparisons across
different experiments and models.
4.4 Implementation Detail
In this section, we delve into the implementation details of our proposed method,
shedding light on the teacher training results achieved on the labeled dataset and
the intricacies of our semi-supervised learning approach, encompassing both offline
38
and online pseudo-labeling techniques. These experiments are conducted using both
the base model and its corresponding momentum network, allowing for a compre-
hensive comparison of their respective outcomes. To gain a deeper understanding of
the impact of labeled data quantity on our method’s efficacy, we further dissect the
dataset, exploring different ratios of labeled data utilization. Through this detailed
exploration, we aim to uncover valuable insights into the factors that influence the
performance of our approach, providing a comprehensive overview of our implemen-
tation methodology.
4.4.1 Training teacher model in a supervised manner in la-
beled data
In the initial phase of our approach, we commence by training the model using
the labeled dataset. This training process involves optimizing the weights of the
original model using the Adam optimizer coupled with back-propagation algorithms.
Simultaneously, a slow copy of the original model is updated, with the momentum
ratio set at 0.95. Throughout this training phase, we meticulously track the model’s
performance on the validation set and save the best checkpoints for both the original
model and its corresponding momentum network.
Table 4.1 presents a performance comparison between the original teacher model
and its momentum network counterpart. Strikingly, our findings reveal that the
momentum model consistently outperforms the original model in various aspects,
demonstrating its superiority. As a result, we make the strategic decision to adopt
the momentum teacher model as the foundational component for the subsequent
semi-supervised experiments detailed in the forthcoming sections. This choice is
grounded in the compelling evidence of its superior performance, underscoring its
pivotal role in the success of our semi-supervised learning approach.
4.4.2 Training Student with Offline Pseudo Labeling (Semi-
Supervised)
Our semi-supervised learning experiments are built upon the foundation of the mo-
mentum teacher model. The student model employed in this strategy shares an
identical architecture with the teacher model. In the offline pseudo-labeling ap-
proach, the generation of pseudo labels is directly derived from the momentum
teacher model. One critical characteristic of this method is that the momentum
teacher model remains static throughout the training process. Consequently, the
pseudo-labels assigned to the same image will remain consistent across all training
iterations, ensuring stability and consistency.
39
The student model is designed to accommodate both the original and momentum
versions of the teacher model, which have been trained on the labeled dataset. This
integration allows the student model to benefit from the knowledge transfer and
guidance provided by the momentum teacher, enabling it to produce more accurate
and robust results during semi-supervised training.
By adopting this offline pseudo-labeling strategy and capitalizing on the momen-
tum teacher model’s stability and expertise, our semi-supervised learning approach
demonstrates promising results in polyp segmentation. This strategy empowers the
student model to leverage the collective knowledge and expertise of the teacher
model, ultimately enhancing its ability to accurately segment polyps in medical
images.
4.4.3 Training Student with Online Pseudo Labeling (Semi-
Supervised)
The online pseudo labeling strategy is founded on the same underlying training prin-
ciples as the offline labeling approach. However, there exists one crucial distinction:
the teacher model’s continuous evolution through the use of Exponential Moving
Average (EMA) in tandem with the student model’s weights after each training
epoch.
With online pseudo labeling, the pseudo label assigned to an image is not static;
instead, it undergoes continuous updates throughout the model training process.
This dynamic nature of pseudo labels ensures that they adapt and evolve along
with the evolving knowledge of both the teacher and student models. This approach
capitalizes on the principle that the teacher model’s guidance and expertise become
more refined as training progresses.
The incorporation of online learning, coupled with the utilization of the momen-
tum student model, has demonstrated superior performance compared to the offline
learning approach. Our experimental results, presented in Table 4.2, underscore the
effectiveness of online pseudo labeling in enhancing the student model’s segmenta-
tion accuracy.
Furthermore, to provide a visual representation of our results, Fig. 4.5 offers an
illustrative insight into the impact of online pseudo labeling on the quality of polyp
segmentation. The continuous updates and refinement of pseudo labels, coupled with
the guidance of the momentum teacher and student models, collectively contribute
to the improved performance of our semi-supervised learning approach in the domain
of polyp segmentation.
40
4.4.4 System configuration
The experiments were carried out using a computer equipped with an Intel Core
i5-7500 CPU operating at 3.4GHz, 32GB of RAM, a GeForce GTX 1080 Ti GPU,
and a 1TB SSD hard disk. The model implementations were developed using the
PyTorch Lightning framework. This configuration provided the necessary computa-
tional resources to train and evaluate our models effectively while ensuring optimal
performance during experimentation and analysis.
4.5 Experimental results
4.5.1 Quantitative results
4.5.1.1 Effectiveness of momentum network
The comparison between the original teacher model and the momentum model
teacher across various ratios of labeled data, as presented in Table 4.1, provides
valuable insights into the effectiveness of the momentum network. The momentum
model is denoted with a checkmark () to distinguish it from the original model
(-).
Table 4.1: A comparison of the original teacher with the momentum model teacher
in different ratios of labeled data
Ratio of Momentum CVC-ClinicDB ETIS CVC-ColonDB CVC-300 Kvarsir Average
labeled data teacher mIoU mDice mIoU mDice mIoU mDice mIoU mDice mIoU mDice mIoU mDice
20%
- 0.794 0.856 0.618 0.698 0.625 0.702 0.803 0.875 0.819 0.883 0.732 0.803
0.792 0.854 0.616 0.699 0.616 0.693 0.800 0.875 0.820 0.884 0.730 0.801
40%
- 0.787 0.857 0.587 0.674 0.624 0.700 0.824 0.892 0.817 0.879 0.728 0.808
0.803 0.868 0.616 0.694 0.642 0.716 0.831 0.898 0.819 0.876 0.743 0.811
60%
- 0.835 0.889 0.610 0.677 0.645 0.721 0.820 0.885 0.842 0.894 0.751 0.814
0.850 0.902 0.620 0.685 0.663 0.741 0.839 0.904 0.847 0.898 0.764 0.826
In the case of 20% labeled data, the momentum model performs slightly better
in terms of mIoU and mDice metrics compared to the original model across all
datasets. This suggests that the momentum network contributes positively to the
model’s performance, even with limited labeled data.
As the ratio of labeled data increases to 40%, the performance gap between the two
models becomes more apparent. The momentum model consistently outperforms the
original model on all datasets, indicating that the momentum network’s stability and
knowledge retention play a crucial role in improving segmentation results.
When 60% of the data is labeled, the momentum model continues to exhibit its
superiority over the original model. The performance gap widens further across all
datasets, underscoring the momentum network’s ability to leverage a higher volume
of labeled data effectively.
41
These results clearly highlight the momentum model’s effectiveness in enhancing
semantic segmentation tasks, particularly when the amount of labeled data is lim-
ited. The stability and knowledge transfer mechanisms inherent to the momentum
network contribute significantly to improved segmentation accuracy, making it a
valuable asset in semi-supervised learning scenarios.
4.5.1.2 Effectiveness of online pseudo labeling
Table 4.2 provides a comprehensive comparison between the online pseudo-labeling
and offline pseudo-labeling strategies for semi-supervised training across different
ratios of labeled data. This analysis sheds light on the effectiveness of these two
approaches.
Table 4.2: A comparison of the online pseudo labeling and offline pseudo labeling
strategy for semi-supervised training
Ratio of Online Momentum CVC-ClinicDB ETIS CVC-ColonDB CVC-300 Kvarsir Average
labeled data pseudo-labels student mIoU mDice mIoU mDice mIoU mDice mIoU mDice mIoU mDice mIoU mDice
20%
- - 0.789 0.847 0.590 0.670 0.647 0.727 0.821 0.891 0.832 0.891 0.736 0.805
- 0.830 0.887 0.676 0.754 0.676 0.755 0.828 0.897 0.843 0.897 0.770 0.838
- 0.801 0.857 0.673 0.748 0.641 0.717 0.830 0.897 0.835 0.895 0.756 0.823
0.816 0.870 0.701 0.772 0.669 0.742 0.836 0.903 0.856 0.911 0.778 0.841
40%
- - 0.792 0.850 0.628 0.704 0.650 0.733 0.833 0.902 0.813 0.875 0.743 0.813
- 0.825 0.882 0.673 0.749 0.671 0.745 0.824 0.894 0.837 0.893 0.766 0.833
- 0.824 0.883 0.601 0.671 0.668 0.750 0.829 0.896 0.838 0.899 0.752 0.820
0.825 0.882 0.702 0.777 0.689 0.768 0.825 0.895 0.850 0.908 0.778 0.846
60%
- - 0.833 0.888 0.617 0.686 0.651 0.725 0.802 0.871 0.842 0.899 0.749 0.814
- 0.856 0.905 0.700 0.773 0.685 0.763 0.839 0.904 0.865 0.915 0.789 0.852
- 0.832 0.881 0.652 0.724 0.674 0.752 0.819 0.880 0.856 0.910 0.767 0.830
0.855 0.901 0.694 0.772 0.701 0.777 0.833 0.898 0.865 0.916 0.790 0.853
In the scenario where only 20% of the data is labeled, we observe that both online
and offline pseudo-labeling strategies contribute to improved segmentation results
compared to the scenarios without pseudo-labels. However, the combination of on-
line pseudo-labeling and momentum student ( ) stands out as the best performer
in terms of mIoU and mDice metrics across all datasets. This indicates that the on-
line strategy, with continuous updates to pseudo-labels during training, significantly
enhances the model’s ability to leverage unlabeled data.
When the proportion of labeled data increases to 40%, a similar trend emerges,
with the online pseudo-labeling strategy ( ) consistently outperforming other
configurations. The superiority of online pseudo-labeling becomes even more evident
as the labeled data ratio reaches 60%, where it consistently yields the highest mIoU
and mDice scores on all datasets.
These results emphasize the substantial benefit of using online pseudo-labeling in
conjunction with a momentum student model. The continuous refinement of pseudo-
labels throughout training enables the model to adapt and refine its predictions,
ultimately leading to superior segmentation performance. Online pseudo-labeling
proves to be a powerful technique for harnessing the potential of unlabeled data in a
42
semi-supervised learning framework, demonstrating its effectiveness across varying
levels of labeled data availability.
4.5.2 Comparison with Different Supervised Methods
Our model’s performance was rigorously assessed by benchmarking it against several
state-of-the-art supervised models, including UNet [14], UNet++ [15], SFA [38],
PraNet [39], MSNet [17], and Shallow Attention [40]. This comparative analysis
aimed to ascertain the efficacy of our semi-supervised approach in the context of
polyp segmentation.
Table 4.3: A comparison of our method with state-of-the-art supervised models
Methods
Ratio of CVC-ClinicDB ETIS CVC-ColonDB CVC-300 Kvarsir
labeled data mIOU mDice mIOU mDice mIOU mDice mIOU mDice mIOU mDice
Unet [14] 100% 0.755 0.823 0.335 0.398 0.444 0.512 0.627 0.710 0.746 0.818
Unet++ [15] 100% 0.729 0.794 0.344 0.401 0.410 0.483 0.624 0.707 0.743 0.821
SFA [38] 100% 0.607 0.700 0.217 0.297 0.347 0.469 0.329 0.467 0.611 0.723
PraNet [39] 100% 0.849 0.899 0.567 0.628 0.640 0.709 0.797 0.871 0.840 0.898
MSNET [17] 100% 0.879 0.921 0.664 0.719 0.678 0.755 0.807 0.869 0.862 0.907
Shallow Attention [40] 100% 0.859 0.916 0.654 0.750 0.670 0.753 0.815 0.888 0.847 0.904
Ours - OPL
20% 0.816 0.870 0.702 0.777 0.669 0.743 0.836 0.904 0.856 0.912
40% 0.825 0.883 0.702 0.777 0.690 0.757 0.825 0.895 0.850 0.909
60% 0.855 0.902 0.694 0.772 0.701 0.767 0.833 0.899 0.865 0.916
Ours - M3C
(w/ U weight) 0.873 0.912 0.701 0.775 0.721 0.793 0.831 0.901 0.883 0.926
(w/o U weight) 0.835 0.888 0.705 0.778 0.691 0.765 0.817 0.882 0.865 0.918
The performance results, measured in terms of both mIoU and mDice metrics,
are presented in Table 4.3. Our model’s performance is showcased across vari-
ous datasets, and the results are contrasted against those of the aforementioned
benchmark models.
Remarkably, our semi-supervised model exhibited superior performance to UNet,
UNet++, PraNet, SFA, and Shallow Attention across all datasets, as evidenced by
the Dice and IoU metrics. It is noteworthy that these outstanding results were
achieved while using only a maximum of 60% of labeled data for training. Our
model’s mDice score lagged behind MSNET by approximately 2% on the CVC-
ClinicDB dataset, yet it showcased a significant advantage across the remaining
datasets. This remarkable performance indicates the superior generalization capa-
bilities of our approach.
Most impressively, our model excelled when subjected to out-of-distribution datasets,
such as ETIS-LabribPolypDB, CVC-300, and CVC-ColonDB. These results under-
score the robustness and adaptability of our semi-supervised method, demonstrating
its ability to outperform established supervised methods in challenging and diverse
scenarios. Our approach represents a significant advancement in the domain of
polyp segmentation, offering state-of-the-art results and the potential for broader
applications in medical image analysis.
43
4.5.3 Qualitative results
4.5.3.1 Comparison between offline and online pseudo labeling
In the comparison between online pseudo-labeling and offline pseudo-labeling, along
with the incorporation of momentum networks, several qualitative observations
emerge from the results, as depicted in Figure. 4.5.
Figure 4.5: Qualitative result comparison between offline pseudo labeling and online
pseudo labeling, with/without using momentum network.
Based on the Figure. 4.5, we have some conclusions as follows
The use of online pseudo-labeling consistently yields more favorable results
compared to offline pseudo-labeling across all datasets and labeled data ratios.
Notably, online pseudo-labeling generates masks that closely resemble the
ground truth annotations. In contrast, the masks produced by offline pseudo-
labeling exhibit significant discrepancies from the ground truth. This indicates
that online pseudo-labeling enables the model to produce more accurate and
visually similar segmentations, as evident in the second row of the table.
The combination of momentum networks with online pseudo-labeling leads to
a remarkable improvement in segmentation quality.
By introducing the momentum network, the model achieves more stable pre-
dictions, effectively removing unnecessary details and noise from the segmen-
tations. This is exemplified in the results, where the incorporation of momen-
44
tum consistently produces smoother and more coherent masks, as highlighted
in row two of the table.
In summary, the synergy between momentum networks and online pseudo-labeling
significantly enhances the segmentation performance. The online strategy not only
produces masks that closely match the ground truth but also adapts continuously,
resulting in refined predictions. Meanwhile, the momentum network contributes to
the model’s stability, reducing unwanted artifacts in the segmentations. This com-
bined approach proves to be highly effective in harnessing the potential of unlabeled
data, ultimately leading to superior and more reliable results in the semi-supervised
learning framework.
4.5.3.2 Comparison with Different Supervised Methods
The qualitative analysis of our proposed method in comparison to traditional su-
pervised approaches, even when utilizing only 60% of the labeled data, reveals re-
markable advantages that underscore the effectiveness of semi-supervised learning.
Figure 4.6: Comparison with Different Supervised Methods
In Figure. 4.6, it is evident that our proposed method excels in producing superior
results compared to its fully supervised counterparts. Despite employing a smaller
fraction of labeled data, our method exhibits a clear focus on the regions containing
polyps within the images. This is particularly noticeable in the uncertainty mask
displayed in the last column of the figure, where our method adeptly identifies
and highlights these critical areas. Moreover, in numerous instances, our proposed
method successfully detects even small polyps, a feat that traditional supervised
methods often struggle to achieve. A prime example can be observed in the first
45
row of the figure, where our method accurately identifies and segments a small polyp,
whereas the supervised methods fail to do so.
Moving on to Figure 4.7, the strength of our semi-supervised approach becomes
even more evident when compared to its supervised counterparts, all while utilizing
the same model architecture. The results obtained through Grad-CAM visualiza-
tion distinctly illustrate the precision of our semi-supervised method in identifying
regions containing polyps. This precision stands in stark contrast to the supervised
learning counterparts, which exhibit less focused attention on these critical regions.
Figure 4.7: Effective of the semi-supervised method in ETIS-LaribPolypDB dataset -
an out-of-domain with training data. Each column represents a different feature map
and binary mask of supervised (using complete labeled data) and semi-supervised
model (using 20% of labeled data and remaining as unlabeled). (a) and (b) are
the input image and corresponding ground truth, (c) and (d) are the GradCAM’s
visualization of the segmentation head, and the output binary mask of the supervised
model, (e) and (f) are the GradCAM’s visualization of segmentation head and the
output binary mask of semi-supervised model
In summary, our semi-supervised approach showcases its prowess by outperforming
traditional supervised methods, even with a reduced labeled dataset. It excels in
pinpointing polyp regions and excels in capturing even small polyps, which elude
many fully supervised techniques. The effectiveness of our approach becomes in-
creasingly apparent when contrasted with conventional supervised learning, as it
consistently exhibits a more accurate and targeted focus on polyp regions during
46
the segmentation process.
4.5.4 Generalization of our proposed method
The ability of our proposed method to generalize across various datasets is a crit-
ical aspect of its effectiveness. We evaluate its generalization on both in-domain
data, represented by Kvasir and CVC-Clinic-DB, and out-of-domain data, including
CVC-300, CVC-ColonDB, and ETIS. Two key qualitative observations are presented
below:
4.5.4.1 In-Domain Data Evaluation
In Figure. 4.8, we compare the performance of our proposed method with super-
vised methods on in-domain datasets. The results demonstrate that our method
consistently achieves comparable or even superior segmentation results compared to
traditional supervised approaches. In several instances, our approach exhibits no-
tably precise delineation of polyps without any instances of excessive segmentation.
This showcases the robustness and effectiveness of our proposed method in handling
in-domain data, ensuring accurate and consistent polyp segmentation.
Figure 4.8: Performance on in-Domain data of our proposed method and different
supervised methods
4.5.4.2 Out-of-Domain Data Evaluation
Figure. 4.9 illustrates the performance of our proposed method in comparison to
supervised methods on out-of-domain datasets. The outcomes highlight the remark-
able generalization capabilities of our approach when faced with data from diverse
47
sources. While some supervised methods may fail to detect or inaccurately segment
polyps, our proposed method consistently performs well. It excels in accurately de-
lineating polyp regions, even in the face of data that differs significantly from the
training domain. This substantial improvement in generalization underscores the
strength of leveraging semi-supervised learning and the momentum network, en-
abling our model to achieve superior generalization across a wide range of datasets.
Figure 4.9: Performance on out-of-Domain data of our proposed method and differ-
ent supervised methods
In summary, our proposed method exhibits exceptional generalization capabilities.
It delivers consistent and precise results on both in-domain and out-of-domain
datasets, outperforming traditional supervised methods. This remarkable adapt-
ability is attributed to the synergy of semi-supervised learning and the momentum
network, which enhances the model’s ability to generalize effectively and maintain
accurate polyp segmentation across diverse data sources.
48
Chapter 5
Conclusion and future work
5.1 Conclusion
In this thesis, we have presented a novel approach to polyp segmentation in colonoscopy
images using semi-supervised learning with momentum networks. Our proposed
method named Online pseudo labeling with momentum networks. The pro-
posed method leverages a limited amount of labeled data in combination with a
larger pool of unlabeled data, achieving remarkable results that outperform tradi-
tional supervised approaches.
Our experiments have demonstrated the effectiveness of our approach in different
scenarios. First, we analyzed the impact of different ratios of labeled to unlabeled
data, showing that our method consistently outperforms supervised methods even
when utilizing only 60% of the labeled data. This highlights the potential of semi-
supervised learning in medical image segmentation tasks, where obtaining large
labeled datasets can be challenging and costly.
Furthermore, we introduced the concept of momentum networks to enhance the sta-
bility and accuracy of our semi-supervised model. The momentum network helped
the student model focus on crucial details and reduced noise in predictions. This
technique not only improved the quality of segmentation masks but also increased
the model’s ability to generalize across different datasets, as shown in our experi-
ments on both in-domain and out-of-domain data.
The comparison with state-of-the-art supervised models revealed the superior per-
formance of our proposed method. Even when trained on a fraction of the labeled
data, our approach achieved better Dice and IoU scores, surpassing these models in
various metrics. Notably, our method excelled in detecting small and challenging
polyps, showcasing its potential for clinical applications where the accurate identi-
fication of polyps is critical.
In addition to quantitative results, our qualitative analysis provided further insights
into the advantages of our approach. The visualizations demonstrated that our
49
method produces segmentation masks that closely match the ground truth, effec-
tively capturing polyp boundaries. The robustness and generalization capabilities
of our method were evident when applied to out-of-domain datasets, where it out-
performed supervised models that struggled to identify and delineate polyps.
In conclusion, our thesis introduces a powerful approach to polyp segmentation
in colonoscopy images. By harnessing semi-supervised learning and momentum
networks, we have achieved remarkable results, surpassing traditional supervised
methods in accuracy, robustness, and generalization. This work has the potential to
significantly impact the field of medical image analysis, offering a promising solution
for improving polyp detection and diagnosis in real-world clinical settings. We look
forward to further research and development in this direction, with the ultimate goal
of enhancing healthcare outcomes through advanced image analysis techniques.
5.2 Future work
While our thesis has made significant strides in the field of polyp segmentation,
there are several promising avenues for future research and improvement.
Semi-supervised Learning Variants: Investigating other semi-supervised
learning variants, such as consistency regularization or self-training, could pro-
vide additional insights and potentially improve segmentation results further.
Data Augmentation Techniques: Exploring more advanced data augmen-
tation methods tailored specifically to colonoscopy images could enhance the
model’s ability to generalize across diverse datasets. This could involve tech-
niques such as domain-specific data augmentation or data synthesis to create
additional training samples.
Real-time Deployment: Adapting our model for real-time or near-real-
time deployment during actual colonoscopy procedures would be a significant
advancement. This would involve optimizing the model for efficiency and
exploring hardware acceleration options.
Integration into Clinical Workflow: Integrating our segmentation model
into existing clinical workflow software and electronic health record systems
could streamline the diagnostic process and assist healthcare providers.
Cross-Domain Application: Our semi-supervised approach, combined with
momentum networks, can be adapted and applied to various domains beyond
polyp segmentation. Researchers can explore its applicability in other medi-
cal imaging tasks, such as the detection and segmentation of tumors, lesions,
50
or anomalies in different organs and modalities, including X-rays, MRIs, CT
scans, and histopathological images. Moreover, this approach can extend be-
yond medical imaging to tackle segmentation tasks in computer vision, re-
mote sensing, and industrial quality control, among others. Evaluating the
model’s performance and generalization capabilities across diverse datasets in
various domains will be an essential step in demonstrating its versatility and
robustness. By venturing into cross-domain applications, we can leverage the
potential of our method to contribute to a wide range of fields, enhancing
automation, accuracy, and efficiency in critical image analysis tasks across
industries. This extension of our research can lead to the development of
adaptable and reliable semi-supervised segmentation models with real-world
impact in multiple domains.
In summary, the field of medical image analysis continues to evolve, and our work
presents a foundation for future research and development in polyp segmentation.
By addressing these future directions, we can work towards improving patient care,
early detection, and more accurate diagnosis of gastrointestinal diseases through
advanced computer vision techniques.
51
References
[1] S. Hosseinzadeh Kassani, P. Hosseinzadeh Kassani, M. J. Wesolowski, K. A.
Schneider, and R. Deters, “Automatic polyp segmentation using convolutional
neural networks,” in Advances in Artificial Intelligence: 33rd Canadian Confer-
ence on Artificial Intelligence, Canadian AI 2020, Ottawa, ON, Canada, May
13–15, 2020, Proceedings 33, pp. 290–301, Springer, 2020.
[2] J. opez Serrano, “Semi-supervised learning for semantic segmentation,” B.S.
thesis, Universitat Polit`ecnica de Catalunya, 2021.
[3] J. Kang and J. Gwak, “Ensemble of instance segmentation models for polyp
segmentation in colonoscopy images,” IEEE Access, vol. 7, pp. 26440–26447,
2019.
[4] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-
averaged consistency targets improve semi-supervised deep learning results,”
Advances in neural information processing systems, vol. 30, 2017.
[5] B. H. Ngo, J. H. Park, S. J. Park, and S. I. Cho, “Semi-supervised domain
adaptation using explicit class-wise matching for domain-invariant and class-
discriminative feature learning,” IEEE Access, vol. 9, pp. 128467–128480, 2021.
[6] O. AlZoubi, S. K. Tawalbeh, and A.-S. Mohammad, “Affect detection from
arabic tweets using ensemble and deep learning techniques,” Journal of King
Saud University-Computer and Information Sciences, vol. 34, no. 6, pp. 2529–
2539, 2022.
[7] Y. Wang, H. Wang, Y. Shen, J. Fei, W. Li, G. Jin, L. Wu, R. Zhao, and X. Le,
“Semi-supervised semantic segmentation using unreliable pseudo-labels,” arXiv
preprint arXiv:2203.03884, 2022.
[8] Y. Zhang, Z. Gong, X. Zheng, X. Zhao, and W. Yao, “Semi-supervision seman-
tic segmentation with uncertainty-guided self cross supervision,” arXiv preprint
arXiv:2203.05118, 2022.
52
[9] Y. Li, G. W. P. Data, Y. Fu, Y. Hu, and V. A. Prisacariu, “Few-shot se-
mantic segmentation with self-supervision from pseudo-classes,” arXiv preprint
arXiv:2110.11742, 2021.
[10] N. Araslanov and S. Roth, “Self-supervised augmentation consistency for
adapting semantic segmentation,” in Proceedings of the IEEE/CVF CVPR,
pp. 15384–15394, 2021.
[11] M. Van Gerven and S. M. Bohte, “Artificial neural networks as models of neural
information processing,” 2017.
[12] M. W. Gardner and S. Dorling, “Artificial neural networks (the multilayer per-
ceptron)—a review of applications in the atmospheric sciences,” Atmospheric
environment, vol. 32, no. 14-15, pp. 2627–2636, 1998.
[13] K. O’Shea and R. Nash, “An introduction to convolutional neural networks,”
arXiv preprint arXiv:1511.08458, 2015.
[14] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for
biomedical image segmentation,” in International Conference on Medical image
computing and computer-assisted intervention, pp. 234–241, Springer, 2015.
[15] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: Re-
designing skip connections to exploit multiscale features in image segmenta-
tion,” IEEE transactions on medical imaging, vol. 39, no. 6, pp. 1856–1867,
2019.
[16] D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Pranet:
Parallel reverse attention network for polyp segmentation,” in International
conference on medical image computing and computer-assisted intervention,
pp. 263–273, Springer, 2020.
[17] X. Zhao, L. Zhang, and H. Lu, “Automatic polyp segmentation via multi-scale
subtraction network,” in International Conference on Medical Image Comput-
ing and Computer-Assisted Intervention, pp. 120–130, Springer, 2021.
[18] C.-H. Huang, H.-Y. Wu, and Y.-L. Lin, “Hardnet-mseg: a simple encoder-
decoder polyp segmentation neural network that achieves over 0.9 mean dice
and 86 fps,” arXiv preprint arXiv:2101.07172, 2021.
[19] N. T. Duc, N. T. Oanh, N. T. Thuy, T. M. Triet, and D. V. Sang, “Colonformer:
An efficient transformer based method for colon polyp segmentation,” arXiv
preprint arXiv:2205.08473, 2022.
53
[20] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A.
Raffel, “Mixmatch: A holistic approach to semi-supervised learning,” Advances
in neural information processing systems, vol. 32, 2019.
[21] X. Luo, J. Chen, T. Song, and G. Wang, “Semi-supervised medical image seg-
mentation through dual-task consistency,” in Proceedings of the AAAI confer-
ence on artificial intelligence, vol. 35, pp. 8801–8809, 2021.
[22] T. P. Van, L. B. Doan, T. T. Nguyen, D. T. Tran, Q. Van Nguyen, and D. V.
Sang, “Online pseudo labeling for polyp segmentation with momentum net-
works,” in 2022 14th International Conference on Knowledge and Systems En-
gineering (KSE), pp. 1–6, IEEE, 2022.
[23] L. Yang, W. Zhuo, L. Qi, Y. Shi, and Y. Gao, “St++: Make self-training
work better for semi-supervised semantic segmentation,” in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 4268–4277, 2022.
[24] S. Seferbekov, V. Iglovikov, A. Buslaev, and A. Shvets, “Feature pyramid net-
work for multi-class land segmentation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition Workshops, pp. 272–275, 2018.
[25] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, and K. Keutzer,
“Densenet: Implementing efficient convnet descriptor pyramids,” arXiv
preprint arXiv:1404.1869, 2014.
[26] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Fea-
ture pyramid networks for object detection,” in Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pp. 2117–2125, 2017.
[27] Z. Feng, Q. Zhou, G. Cheng, X. Tan, J. Shi, and L. Ma, “Semi-supervised se-
mantic segmentation via dynamic self-training and classbalanced curriculum,”
arXiv preprint arXiv:2004.08514, vol. 1, no. 2, p. 5, 2020.
[28] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D.
Cubuk, A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-supervised
learning with consistency and confidence,” Advances in Neural Information
Processing Systems, vol. 33, pp. 596–608, 2020.
[29] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum, “Label propagation for deep
semi-supervised learning,” in Proceedings of the IEEE/CVF CVPR, pp. 5070–
5079, 2019.
54
[30] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton, “Big self-
supervised models are strong semi-supervised learners,” Advances in neural
information processing systems, vol. 33, pp. 22243–22255, 2020.
[31] N. Abraham and N. M. Khan, “A novel focal tversky loss function with im-
proved attention u-net for lesion segmentation,” in 2019 IEEE 16th interna-
tional symposium on biomedical imaging (ISBI 2019), pp. 683–687, IEEE, 2019.
[32] K. Miok, D. Nguyen-Doan, D. Zaharie, and M. Robnik-
ˇ
Sikonja, “Generating
data using monte carlo dropout,” in 2019 IEEE 15th International Conference
on Intelligent Computer Communication and Processing (ICCP), pp. 509–515,
IEEE, 2019.
[33] D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. de Lange, D. Johansen,
and H. D. Johansen, “Kvasir-seg: A segmented polyp dataset,” in MultiMedia
Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea,
January 5–8, 2020, Proceedings, Part II 26, pp. 451–462, Springer, 2020.
[34] J. Bernal, F. J. anchez, G. Fern´andez-Esparrach, D. Gil, C. Rodr´ıguez, and
F. Vilari˜no, “Wm-dova maps for accurate polyp highlighting in colonoscopy:
Validation vs. saliency maps from physicians,” Computerized medical imaging
and graphics, vol. 43, pp. 99–111, 2015.
[35] N. Tajbakhsh, S. R. Gurudu, and J. Liang, “Automated polyp detection in
colonoscopy videos using shape and context information,” IEEE transactions
on medical imaging, vol. 35, no. 2, pp. 630–644, 2015.
[36] D. azquez, J. Bernal, F. J. anchez, G. Fern´andez-Esparrach, A. M. opez,
A. Romero, M. Drozdzal, A. Courville, et al., “A benchmark for endoluminal
scene segmentation of colonoscopy images,” Journal of healthcare engineering,
vol. 2017, 2017.
[37] J. Silva, A. Histace, O. Romain, X. Dray, and B. Granado, “Toward embedded
detection of polyps in wce images for early diagnosis of colorectal cancer,”
International journal of computer assisted radiology and surgery, vol. 9, pp. 283–
293, 2014.
[38] Y. Fang, C. Chen, Y. Yuan, and K.-y. Tong, “Selective feature aggregation net-
work with area-boundary constraints for polyp segmentation,” in International
Conference on Medical Image Computing and Computer-Assisted Intervention,
pp. 302–310, Springer, 2019.
[39] D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Pranet:
Parallel reverse attention network for polyp segmentation,” in International
55
Conference on Medical Image Computing and Computer-Assisted Intervention,
pp. 263–273, Springer, 2020.
[40] J. Wei, Y. Hu, R. Zhang, Z. Li, S. K. Zhou, and S. Cui, “Shallow attention
network for polyp segmentation,” in International Conference on Medical Image
Computing and Computer-Assisted Intervention, pp. 699–708, Springer, 2021.
56